generated from coulomb/repo-seed
blob deduplication and content streaming
This commit is contained in:
88
docs/blob-storage-content-streaming-workplan.md
Normal file
88
docs/blob-storage-content-streaming-workplan.md
Normal file
@@ -0,0 +1,88 @@
|
||||
# Blob Storage And Content Streaming Workplan
|
||||
|
||||
Date: 2026-05-07
|
||||
|
||||
Status: planned.
|
||||
|
||||
## Purpose
|
||||
|
||||
Add efficient, governed blob handling to `kontextual-engine` so source,
|
||||
normalized, and derived representations can reference real content bytes
|
||||
without duplicating storage. Expose those bytes through engine-native
|
||||
interfaces and CMIS content stream routes.
|
||||
|
||||
## Current State
|
||||
|
||||
The engine already records representation metadata:
|
||||
|
||||
- digest,
|
||||
- size,
|
||||
- media type,
|
||||
- representation kind,
|
||||
- opaque `storage_ref`.
|
||||
|
||||
It does not yet provide:
|
||||
|
||||
- a content-addressed blob store,
|
||||
- deduplicating writes,
|
||||
- blob read/stream interfaces,
|
||||
- reference accounting or garbage collection,
|
||||
- CMIS byte-stream download semantics.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
```text
|
||||
bytes
|
||||
-> digest/size verification
|
||||
-> BlobStoragePort
|
||||
-> content-addressed adapter
|
||||
-> AssetRepresentation storage_ref
|
||||
-> governed representation service
|
||||
-> service API / CMIS content stream
|
||||
```
|
||||
|
||||
Core rule: blob storage is infrastructure. `AssetRepresentation` remains the
|
||||
engine-level content reference with provenance, digest, media type, kind,
|
||||
version, policy, and audit context.
|
||||
|
||||
## Dedupe Strategy
|
||||
|
||||
Use whole-object content-addressing first:
|
||||
|
||||
- digest: `sha256:<hex>`,
|
||||
- storage key: digest-derived path or adapter key,
|
||||
- idempotent write: if digest already exists, reuse blob,
|
||||
- verify size and digest before returning a blob reference,
|
||||
- representation rows may duplicate references but must not duplicate bytes.
|
||||
|
||||
Chunk-level deduplication is explicitly deferred until large-file evidence
|
||||
justifies the complexity.
|
||||
|
||||
## Interfaces
|
||||
|
||||
Planned engine-native interfaces:
|
||||
|
||||
- `BlobStoragePort.put_bytes(...)`
|
||||
- `BlobStoragePort.open_bytes(...)`
|
||||
- `BlobStoragePort.stat(...)`
|
||||
- `BlobStoragePort.exists(...)`
|
||||
- `BlobStoragePort.delete_unreferenced(...)`
|
||||
- `RepresentationContentService.add_representation_from_bytes(...)`
|
||||
- `RepresentationContentService.get_content_stream(...)`
|
||||
|
||||
Planned CMIS integration:
|
||||
|
||||
- `getContentStream` returns actual bytes/stream with content headers,
|
||||
- `setContentStream` stores through deduplicating representation service,
|
||||
- content stream changes produce versions and audit events,
|
||||
- descriptors remain available for clients that only need metadata.
|
||||
|
||||
## Risks
|
||||
|
||||
- Large files may require streaming APIs rather than in-memory bytes.
|
||||
- Local filesystem adapters need atomic writes and digest verification.
|
||||
- Garbage collection must never delete referenced blobs.
|
||||
- Security must treat blob bytes as governed content, not public storage.
|
||||
- Existing `storage_ref` values may point to external sources and should remain
|
||||
valid as opaque references.
|
||||
|
||||
Reference in New Issue
Block a user