generated from coulomb/repo-seed
content-addressed blob storage: blob_storage.py, memory, local, and S3 adapters
This commit is contained in:
@@ -190,7 +190,7 @@ Required MVP ports:
|
||||
|
||||
- Repository port for assets, representations, metadata, relationships,
|
||||
versions, runs, audit events, and exports.
|
||||
- Object/content store port for source, normalized, and derived content payloads.
|
||||
- Blob/content store port for source, normalized, and derived content payloads.
|
||||
- Search index port for lexical search and later semantic/hybrid retrieval.
|
||||
- Extractor port for format-specific normalization.
|
||||
- Connector port for source systems.
|
||||
@@ -211,6 +211,10 @@ Adapter rules:
|
||||
Markitect where useful, but they are not the canonical engine identity or
|
||||
storage model. The canonical layer remains asset, representation, metadata,
|
||||
lifecycle, policy, lineage, and audit state.
|
||||
- Blob storage is infrastructure behind `AssetRepresentation.storage_ref`.
|
||||
Whole-object content addressing, digest verification, and chunked byte
|
||||
streaming belong behind the blob port. Local filesystem and S3 are adapters,
|
||||
not different domain models.
|
||||
- `llm-connect` or equivalent is an adapter for LLM providers.
|
||||
- `phase-memory` is an adjacent memory runtime; this engine may exchange opaque
|
||||
memory references or context packages but should not implement memory phases.
|
||||
@@ -251,6 +255,9 @@ Recommended storage style:
|
||||
adapter-specific payloads.
|
||||
- Separate content/object references for large source, normalized, or derived
|
||||
payloads.
|
||||
- Store blob bytes outside repository rows when content is non-trivial. Keep
|
||||
representation digest, size, media type, kind, producer, and storage ref in
|
||||
the repository, and let blob adapters handle byte persistence and dedupe.
|
||||
- Append-only audit events and change records.
|
||||
- Deterministic ordering fields for pagination and tests.
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
Date: 2026-05-07
|
||||
|
||||
Status: planned.
|
||||
Status: implemented.
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -11,23 +11,25 @@ normalized, and derived representations can reference real content bytes
|
||||
without duplicating storage. Expose those bytes through engine-native
|
||||
interfaces and CMIS content stream routes.
|
||||
|
||||
## Current State
|
||||
## Implemented State
|
||||
|
||||
The engine already records representation metadata:
|
||||
The engine records representation metadata:
|
||||
|
||||
- digest,
|
||||
- size,
|
||||
- media type,
|
||||
- representation kind,
|
||||
- opaque `storage_ref`.
|
||||
- `storage_ref`.
|
||||
|
||||
It does not yet provide:
|
||||
It now provides:
|
||||
|
||||
- a content-addressed blob store,
|
||||
- deduplicating writes,
|
||||
- blob read/stream interfaces,
|
||||
- reference accounting or garbage collection,
|
||||
- CMIS byte-stream download semantics.
|
||||
- a content-addressed blob storage port,
|
||||
- in-memory, local filesystem, and optional S3 adapters,
|
||||
- deduplicating writes by `sha256:<hex>` digest,
|
||||
- whole-byte reads plus chunked `iter_bytes(...)` streaming,
|
||||
- representation-level content service governance,
|
||||
- reference accounting and dry-run/active cleanup,
|
||||
- CMIS Browser Binding content stream byte routes.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
@@ -35,7 +37,7 @@ It does not yet provide:
|
||||
bytes
|
||||
-> digest/size verification
|
||||
-> BlobStoragePort
|
||||
-> content-addressed adapter
|
||||
-> content-addressed adapter (memory/local/S3)
|
||||
-> AssetRepresentation storage_ref
|
||||
-> governed representation service
|
||||
-> service API / CMIS content stream
|
||||
@@ -60,29 +62,52 @@ justifies the complexity.
|
||||
|
||||
## Interfaces
|
||||
|
||||
Planned engine-native interfaces:
|
||||
Engine-native interfaces:
|
||||
|
||||
- `BlobStoragePort.put_bytes(...)`
|
||||
- `BlobStoragePort.open_bytes(...)`
|
||||
- `BlobStoragePort.read_bytes(...)`
|
||||
- `BlobStoragePort.iter_bytes(...)`
|
||||
- `BlobStoragePort.stat(...)`
|
||||
- `BlobStoragePort.exists(...)`
|
||||
- `BlobStoragePort.delete_unreferenced(...)`
|
||||
- `RepresentationContentService.add_representation_from_bytes(...)`
|
||||
- `RepresentationContentService.get_content_stream(...)`
|
||||
- `RepresentationContentService.stream_content(...)`
|
||||
|
||||
Planned CMIS integration:
|
||||
CMIS integration:
|
||||
|
||||
- `getContentStream` returns actual bytes/stream with content headers,
|
||||
- `setContentStream` stores through deduplicating representation service,
|
||||
- content stream changes produce versions and audit events,
|
||||
- descriptors remain available for clients that only need metadata.
|
||||
|
||||
## Storage Backends
|
||||
|
||||
- `InMemoryBlobStorage` supports deterministic unit tests and default runtime
|
||||
wiring.
|
||||
- `LocalBlobStorage` stores content under digest-derived paths and uses atomic
|
||||
temporary writes.
|
||||
- `S3BlobStorage` is available through the optional `kontextual-engine[s3]`
|
||||
extra and keeps S3 concerns behind the same blob port. It uses digest-derived
|
||||
object keys and streams object bodies in chunks.
|
||||
|
||||
The engine stores only the returned `storage_ref` on representations. Backend
|
||||
selection is therefore a deployment concern, not a domain-model fork.
|
||||
|
||||
## Migration Posture
|
||||
|
||||
Existing opaque `storage_ref` values remain valid metadata, but content bytes
|
||||
can only be streamed when the configured blob adapter can resolve the reference.
|
||||
Migration should import external content through
|
||||
`RepresentationContentService.add_representation_from_bytes(...)` so dedupe,
|
||||
digest verification, policy, versions, and audit events are preserved.
|
||||
|
||||
## Risks
|
||||
|
||||
- Large files may require streaming APIs rather than in-memory bytes.
|
||||
- Very large files may require upload-side streaming beyond the current
|
||||
byte-based write API.
|
||||
- Local filesystem adapters need atomic writes and digest verification.
|
||||
- Garbage collection must never delete referenced blobs.
|
||||
- Security must treat blob bytes as governed content, not public storage.
|
||||
- Existing `storage_ref` values may point to external sources and should remain
|
||||
valid as opaque references.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user