kontextual-engine/docs/blob-storage-content-streaming-workplan.md

# Blob Storage And Content Streaming Workplan

Date: 2026-05-07

Status: implemented.

## Purpose

Add efficient, governed blob handling to `kontextual-engine` so source,
normalized, and derived representations can reference real content bytes
without duplicating storage. Expose those bytes through engine-native
interfaces and CMIS content stream routes.

## Implemented State

The engine records representation metadata:

- digest,
- size,
- media type,
- representation kind,
- `storage_ref`.

It now provides:

- a content-addressed blob storage port,
- in-memory, local filesystem, and optional S3 adapters,
- deduplicating writes by `sha256:<hex>` digest,
- whole-byte reads plus chunked `iter_bytes(...)` streaming,
- representation-level content service governance,
- reference accounting and dry-run/active cleanup,
- CMIS Browser Binding content stream byte routes.

## Target Architecture

```text
bytes
  -> digest/size verification
  -> BlobStoragePort
  -> content-addressed adapter (memory/local/S3)
  -> AssetRepresentation storage_ref
  -> governed representation service
  -> service API / CMIS content stream
```

Core rule: blob storage is infrastructure. `AssetRepresentation` remains the
engine-level content reference with provenance, digest, media type, kind,
version, policy, and audit context.

## Dedupe Strategy

Use whole-object content-addressing first:

- digest: `sha256:<hex>`,
- storage key: digest-derived path or adapter key,
- idempotent write: if digest already exists, reuse blob,
- verify size and digest before returning a blob reference,
- representation rows may duplicate references but must not duplicate bytes.

Chunk-level deduplication is explicitly deferred until large-file evidence
justifies the complexity.

## Interfaces

Engine-native interfaces:

- `BlobStoragePort.put_bytes(...)`
- `BlobStoragePort.read_bytes(...)`
- `BlobStoragePort.iter_bytes(...)`
- `BlobStoragePort.stat(...)`
- `BlobStoragePort.exists(...)`
- `BlobStoragePort.delete_unreferenced(...)`
- `RepresentationContentService.add_representation_from_bytes(...)`
- `RepresentationContentService.get_content_stream(...)`
- `RepresentationContentService.stream_content(...)`

CMIS integration:

- `getContentStream` returns actual bytes/stream with content headers,
- `setContentStream` stores through deduplicating representation service,
- `appendContentStream` composes the current stream plus appended bytes and
  stores the resulting representation through the same deduplicating service,
- content stream changes produce versions and audit events,
- descriptors remain available for clients that only need metadata.

## Storage Backends

- `InMemoryBlobStorage` supports deterministic unit tests and default runtime
  wiring.
- `LocalBlobStorage` stores content under digest-derived paths and uses atomic
  temporary writes.
- `S3BlobStorage` is available through the optional `kontextual-engine[s3]`
  extra and keeps S3 concerns behind the same blob port. It uses digest-derived
  object keys and streams object bodies in chunks.

The engine stores only the returned `storage_ref` on representations. Backend
selection is therefore a deployment concern, not a domain-model fork.

## Migration Posture

Existing opaque `storage_ref` values remain valid metadata, but content bytes
can only be streamed when the configured blob adapter can resolve the reference.
Migration should import external content through
`RepresentationContentService.add_representation_from_bytes(...)` so dedupe,
digest verification, policy, versions, and audit events are preserved.

## Risks

- Very large files may require upload-side streaming beyond the current
  byte-based write API.
- Local filesystem adapters need atomic writes and digest verification.
- Garbage collection must never delete referenced blobs.
- Security must treat blob bytes as governed content, not public storage.
- Existing `storage_ref` values may point to external sources and should remain
  valid as opaque references.