generated from coulomb/repo-seed
116 lines
3.9 KiB
Markdown
116 lines
3.9 KiB
Markdown
# Blob Storage And Content Streaming Workplan
|
|
|
|
Date: 2026-05-07
|
|
|
|
Status: implemented.
|
|
|
|
## Purpose
|
|
|
|
Add efficient, governed blob handling to `kontextual-engine` so source,
|
|
normalized, and derived representations can reference real content bytes
|
|
without duplicating storage. Expose those bytes through engine-native
|
|
interfaces and CMIS content stream routes.
|
|
|
|
## Implemented State
|
|
|
|
The engine records representation metadata:
|
|
|
|
- digest,
|
|
- size,
|
|
- media type,
|
|
- representation kind,
|
|
- `storage_ref`.
|
|
|
|
It now provides:
|
|
|
|
- a content-addressed blob storage port,
|
|
- in-memory, local filesystem, and optional S3 adapters,
|
|
- deduplicating writes by `sha256:<hex>` digest,
|
|
- whole-byte reads plus chunked `iter_bytes(...)` streaming,
|
|
- representation-level content service governance,
|
|
- reference accounting and dry-run/active cleanup,
|
|
- CMIS Browser Binding content stream byte routes.
|
|
|
|
## Target Architecture
|
|
|
|
```text
|
|
bytes
|
|
-> digest/size verification
|
|
-> BlobStoragePort
|
|
-> content-addressed adapter (memory/local/S3)
|
|
-> AssetRepresentation storage_ref
|
|
-> governed representation service
|
|
-> service API / CMIS content stream
|
|
```
|
|
|
|
Core rule: blob storage is infrastructure. `AssetRepresentation` remains the
|
|
engine-level content reference with provenance, digest, media type, kind,
|
|
version, policy, and audit context.
|
|
|
|
## Dedupe Strategy
|
|
|
|
Use whole-object content-addressing first:
|
|
|
|
- digest: `sha256:<hex>`,
|
|
- storage key: digest-derived path or adapter key,
|
|
- idempotent write: if digest already exists, reuse blob,
|
|
- verify size and digest before returning a blob reference,
|
|
- representation rows may duplicate references but must not duplicate bytes.
|
|
|
|
Chunk-level deduplication is explicitly deferred until large-file evidence
|
|
justifies the complexity.
|
|
|
|
## Interfaces
|
|
|
|
Engine-native interfaces:
|
|
|
|
- `BlobStoragePort.put_bytes(...)`
|
|
- `BlobStoragePort.read_bytes(...)`
|
|
- `BlobStoragePort.iter_bytes(...)`
|
|
- `BlobStoragePort.stat(...)`
|
|
- `BlobStoragePort.exists(...)`
|
|
- `BlobStoragePort.delete_unreferenced(...)`
|
|
- `RepresentationContentService.add_representation_from_bytes(...)`
|
|
- `RepresentationContentService.get_content_stream(...)`
|
|
- `RepresentationContentService.stream_content(...)`
|
|
|
|
CMIS integration:
|
|
|
|
- `getContentStream` returns actual bytes/stream with content headers,
|
|
- `setContentStream` stores through deduplicating representation service,
|
|
- `appendContentStream` composes the current stream plus appended bytes and
|
|
stores the resulting representation through the same deduplicating service,
|
|
- content stream changes produce versions and audit events,
|
|
- descriptors remain available for clients that only need metadata.
|
|
|
|
## Storage Backends
|
|
|
|
- `InMemoryBlobStorage` supports deterministic unit tests and default runtime
|
|
wiring.
|
|
- `LocalBlobStorage` stores content under digest-derived paths and uses atomic
|
|
temporary writes.
|
|
- `S3BlobStorage` is available through the optional `kontextual-engine[s3]`
|
|
extra and keeps S3 concerns behind the same blob port. It uses digest-derived
|
|
object keys and streams object bodies in chunks.
|
|
|
|
The engine stores only the returned `storage_ref` on representations. Backend
|
|
selection is therefore a deployment concern, not a domain-model fork.
|
|
|
|
## Migration Posture
|
|
|
|
Existing opaque `storage_ref` values remain valid metadata, but content bytes
|
|
can only be streamed when the configured blob adapter can resolve the reference.
|
|
Migration should import external content through
|
|
`RepresentationContentService.add_representation_from_bytes(...)` so dedupe,
|
|
digest verification, policy, versions, and audit events are preserved.
|
|
|
|
## Risks
|
|
|
|
- Very large files may require upload-side streaming beyond the current
|
|
byte-based write API.
|
|
- Local filesystem adapters need atomic writes and digest verification.
|
|
- Garbage collection must never delete referenced blobs.
|
|
- Security must treat blob bytes as governed content, not public storage.
|
|
- Existing `storage_ref` values may point to external sources and should remain
|
|
valid as opaque references.
|