diff --git a/docs/blob-storage-content-streaming-workplan.md b/docs/blob-storage-content-streaming-workplan.md new file mode 100644 index 0000000..6e9db7f --- /dev/null +++ b/docs/blob-storage-content-streaming-workplan.md @@ -0,0 +1,88 @@ +# Blob Storage And Content Streaming Workplan + +Date: 2026-05-07 + +Status: planned. + +## Purpose + +Add efficient, governed blob handling to `kontextual-engine` so source, +normalized, and derived representations can reference real content bytes +without duplicating storage. Expose those bytes through engine-native +interfaces and CMIS content stream routes. + +## Current State + +The engine already records representation metadata: + +- digest, +- size, +- media type, +- representation kind, +- opaque `storage_ref`. + +It does not yet provide: + +- a content-addressed blob store, +- deduplicating writes, +- blob read/stream interfaces, +- reference accounting or garbage collection, +- CMIS byte-stream download semantics. + +## Target Architecture + +```text +bytes + -> digest/size verification + -> BlobStoragePort + -> content-addressed adapter + -> AssetRepresentation storage_ref + -> governed representation service + -> service API / CMIS content stream +``` + +Core rule: blob storage is infrastructure. `AssetRepresentation` remains the +engine-level content reference with provenance, digest, media type, kind, +version, policy, and audit context. + +## Dedupe Strategy + +Use whole-object content-addressing first: + +- digest: `sha256:`, +- storage key: digest-derived path or adapter key, +- idempotent write: if digest already exists, reuse blob, +- verify size and digest before returning a blob reference, +- representation rows may duplicate references but must not duplicate bytes. + +Chunk-level deduplication is explicitly deferred until large-file evidence +justifies the complexity. + +## Interfaces + +Planned engine-native interfaces: + +- `BlobStoragePort.put_bytes(...)` +- `BlobStoragePort.open_bytes(...)` +- `BlobStoragePort.stat(...)` +- `BlobStoragePort.exists(...)` +- `BlobStoragePort.delete_unreferenced(...)` +- `RepresentationContentService.add_representation_from_bytes(...)` +- `RepresentationContentService.get_content_stream(...)` + +Planned CMIS integration: + +- `getContentStream` returns actual bytes/stream with content headers, +- `setContentStream` stores through deduplicating representation service, +- content stream changes produce versions and audit events, +- descriptors remain available for clients that only need metadata. + +## Risks + +- Large files may require streaming APIs rather than in-memory bytes. +- Local filesystem adapters need atomic writes and digest verification. +- Garbage collection must never delete referenced blobs. +- Security must treat blob bytes as governed content, not public storage. +- Existing `storage_ref` values may point to external sources and should remain + valid as opaque references. + diff --git a/workplans/KONT-WP-0013-blob-storage-content-streaming.md b/workplans/KONT-WP-0013-blob-storage-content-streaming.md new file mode 100644 index 0000000..67adb72 --- /dev/null +++ b/workplans/KONT-WP-0013-blob-storage-content-streaming.md @@ -0,0 +1,170 @@ +--- +id: KONT-WP-0013 +type: workplan +title: "Blob Storage Deduplication And Content Streaming" +domain: markitect +repo: kontextual-engine +status: active +owner: codex +topic_slug: markitect +planning_priority: high +planning_order: 13 +created: "2026-05-07" +updated: "2026-05-07" +state_hub_workstream_id: "21355091-1ebe-4662-983c-4795deea2adc" +--- + +# KONT-WP-0013: Blob Storage Deduplication And Content Streaming + +## Purpose + +Implement efficient blob handling and content streaming for +`kontextual-engine`. The engine should store and expose representation bytes +through governed, deduplicating interfaces while preserving the existing +asset/representation/provenance model. + +## References + +- `docs/blob-storage-content-streaming-workplan.md` +- `docs/architecture-blueprint.md` +- `docs/cmis-deployment-compatibility.md` +- `src/kontextual_engine/core/assets.py` +- `src/kontextual_engine/core/cmis.py` + +## Boundary + +This workplan adds content-addressed blob infrastructure and stream interfaces. +It does not introduce AtomPub, SOAP/Web Services, chunk-level deduplication, or +a general document-management storage model. + +## Architecture Constraint + +Blob bytes are infrastructure state. Engine semantics remain attached to +`AssetRepresentation`, `AssetVersion`, policy decisions, audit records, +source references, and lineage. CMIS content streaming must delegate through +engine-native content services instead of bypassing governance. + +## D13.1 - Define blob storage port and blob reference model + +```task +id: KONT-WP-0013-T001 +status: todo +priority: high +state_hub_task_id: "6bb5b49a-cf9f-47ce-86d3-24b47a20a2c6" +``` + +Acceptance: + +- A stable blob storage port is defined for put, read/open, stat, exists, and + delete-unreferenced operations. +- Blob references include digest, size, media type where appropriate, storage + key, and adapter name. +- Port contracts support deterministic tests without requiring filesystem or + external object storage. + +## D13.2 - Implement content-addressed local blob adapter + +```task +id: KONT-WP-0013-T002 +status: todo +priority: high +state_hub_task_id: "661386c7-8094-4f0f-928c-c17f5b3a9132" +``` + +Acceptance: + +- Local adapter stores blobs by digest-derived path. +- Writes are idempotent and verify digest/size. +- Duplicate content does not duplicate stored bytes. +- Atomic write behavior is covered by tests where practical. + +## D13.3 - Implement representation content service + +```task +id: KONT-WP-0013-T003 +status: todo +priority: high +state_hub_task_id: "00bc34c5-0f79-47b6-b305-f47311edd3a7" +``` + +Acceptance: + +- Service creates `AssetRepresentation` records from bytes through blob storage. +- Existing asset service mutation paths can delegate to the content service. +- Content changes create versions and audit events. +- Existing opaque `storage_ref` behavior remains compatible. + +## D13.4 - Add blob reference accounting and safe cleanup + +```task +id: KONT-WP-0013-T004 +status: todo +priority: medium +state_hub_task_id: "cc4445d9-f773-4337-afd4-aeccc743dc1e" +``` + +Acceptance: + +- Referenced blob discovery is deterministic from representation records. +- Cleanup can identify unreferenced blobs without deleting active content. +- Dry-run cleanup reports reclaimable bytes and references. + +## D13.5 - Expose engine-native content stream interfaces + +```task +id: KONT-WP-0013-T005 +status: todo +priority: high +state_hub_task_id: "db0e8a2d-50ce-439c-8393-d65e2fc4bc9e" +``` + +Acceptance: + +- Service/runtime methods can fetch representation bytes or stream handles. +- Response metadata includes digest, size, media type, and representation ID. +- Policy checks happen before bytes are exposed. +- Tests cover source, normalized, and derived representation reads. + +## D13.6 - Integrate CMIS content stream byte semantics + +```task +id: KONT-WP-0013-T006 +status: todo +priority: high +state_hub_task_id: "2f1da1fb-9634-4ba6-931a-3e29394efd37" +``` + +Acceptance: + +- CMIS `getContentStream` can return actual bytes/stream semantics when content + is locally available. +- CMIS `setContentStream` writes through the deduplicating content service. +- CMIS descriptors remain available for metadata-only clients. +- Unsupported external/opaque storage refs produce structured diagnostics. + +## D13.7 - Document deployment, migration, and capacity posture + +```task +id: KONT-WP-0013-T007 +status: todo +priority: medium +state_hub_task_id: "987ad4f6-8658-4e93-82c2-b9fa0a3a2270" +``` + +Acceptance: + +- Blob root configuration and local adapter behavior are documented. +- Existing representations with opaque `storage_ref` values have a migration + posture. +- Capacity tests demonstrate dedupe effectiveness and avoid excessive fixture + size. +- Operational cleanup guidance is documented. + +## Definition Of Done + +- Blob storage port and local content-addressed adapter exist. +- Representation bytes can be stored, deduplicated, read, and governed. +- CMIS content stream routes can expose real bytes when available. +- Existing tests continue to pass. +- Focused dedupe/content-stream tests cover duplicate content, readback, + policy denial, cleanup dry-run, and CMIS integration.