--- id: KONT-WP-0013 type: workplan title: "Blob Storage Deduplication And Content Streaming" domain: markitect repo: kontextual-engine status: completed owner: codex topic_slug: markitect planning_priority: high planning_order: 13 created: "2026-05-07" updated: "2026-05-07" state_hub_workstream_id: "21355091-1ebe-4662-983c-4795deea2adc" --- # KONT-WP-0013: Blob Storage Deduplication And Content Streaming ## Purpose Implement efficient blob handling and content streaming for `kontextual-engine`. The engine should store and expose representation bytes through governed, deduplicating interfaces while preserving the existing asset/representation/provenance model. ## References - `docs/blob-storage-content-streaming-workplan.md` - `docs/architecture-blueprint.md` - `docs/cmis-deployment-compatibility.md` - `src/kontextual_engine/core/assets.py` - `src/kontextual_engine/core/cmis.py` ## Boundary This workplan adds content-addressed blob infrastructure and stream interfaces. It does not introduce AtomPub, SOAP/Web Services, chunk-level deduplication, or a general document-management storage model. It includes an optional S3 backend as an infrastructure adapter behind the same blob storage port. S3 object keys are digest-derived, so object storage can be used without changing engine semantics or CMIS profile governance. ## Architecture Constraint Blob bytes are infrastructure state. Engine semantics remain attached to `AssetRepresentation`, `AssetVersion`, policy decisions, audit records, source references, and lineage. CMIS content streaming must delegate through engine-native content services instead of bypassing governance. ## D13.1 - Define blob storage port and blob reference model ```task id: KONT-WP-0013-T001 status: done priority: high state_hub_task_id: "6bb5b49a-cf9f-47ce-86d3-24b47a20a2c6" ``` Acceptance: - A stable blob storage port is defined for put, read/open, stat, exists, and delete-unreferenced operations. - Blob references include digest, size, media type where appropriate, storage key, and adapter name. - Port contracts support deterministic tests without requiring filesystem or external object storage. ## D13.2 - Implement content-addressed local blob adapter ```task id: KONT-WP-0013-T002 status: done priority: high state_hub_task_id: "661386c7-8094-4f0f-928c-c17f5b3a9132" ``` Acceptance: - Local adapter stores blobs by digest-derived path. - Writes are idempotent and verify digest/size. - Duplicate content does not duplicate stored bytes. - Atomic write behavior is covered by tests where practical. ## D13.3 - Implement representation content service ```task id: KONT-WP-0013-T003 status: done priority: high state_hub_task_id: "00bc34c5-0f79-47b6-b305-f47311edd3a7" ``` Acceptance: - Service creates `AssetRepresentation` records from bytes through blob storage. - Existing asset service mutation paths can delegate to the content service. - Content changes create versions and audit events. - Existing opaque `storage_ref` behavior remains compatible. ## D13.4 - Add blob reference accounting and safe cleanup ```task id: KONT-WP-0013-T004 status: done priority: medium state_hub_task_id: "cc4445d9-f773-4337-afd4-aeccc743dc1e" ``` Acceptance: - Referenced blob discovery is deterministic from representation records. - Cleanup can identify unreferenced blobs without deleting active content. - Dry-run cleanup reports reclaimable bytes and references. ## D13.5 - Expose engine-native content stream interfaces ```task id: KONT-WP-0013-T005 status: done priority: high state_hub_task_id: "db0e8a2d-50ce-439c-8393-d65e2fc4bc9e" ``` Acceptance: - Service/runtime methods can fetch representation bytes or stream handles. - Response metadata includes digest, size, media type, and representation ID. - Policy checks happen before bytes are exposed. - Tests cover source, normalized, and derived representation reads. ## D13.6 - Integrate CMIS content stream byte semantics ```task id: KONT-WP-0013-T006 status: done priority: high state_hub_task_id: "2f1da1fb-9634-4ba6-931a-3e29394efd37" ``` Acceptance: - CMIS `getContentStream` can return actual bytes/stream semantics when content is locally available. - CMIS `setContentStream` writes through the deduplicating content service. - CMIS descriptors remain available for metadata-only clients. - Unsupported external/opaque storage refs produce structured diagnostics. ## D13.7 - Document deployment, migration, and capacity posture ```task id: KONT-WP-0013-T007 status: done priority: medium state_hub_task_id: "987ad4f6-8658-4e93-82c2-b9fa0a3a2270" ``` Acceptance: - Blob root configuration and local adapter behavior are documented. - Existing representations with opaque `storage_ref` values have a migration posture. - Capacity tests demonstrate dedupe effectiveness and avoid excessive fixture size. - Operational cleanup guidance is documented. ## Definition Of Done - Blob storage port and local content-addressed adapter exist. - Representation bytes can be stored, deduplicated, read, and governed. - CMIS content stream routes can expose real bytes when available. - Existing tests continue to pass. - Focused dedupe/content-stream tests cover duplicate content, readback, policy denial, cleanup dry-run, and CMIS integration. ## Completion Notes - Implemented `BlobStorage` port with `put_bytes`, `read_bytes`, `iter_bytes`, `stat`, `exists`, and `delete_unreferenced`. - Added in-memory, local filesystem, and optional S3 content-addressed adapters. - Added governed representation content service for byte-backed representations, chunked streams, policy checks, audit events, and cleanup. - Wired CMIS `setContentStream` and byte stream routes through the content service; repeated content updates now expose the latest source representation. - Added tests for dedupe, local/S3 adapter behavior, content-kind reads, policy denial, cleanup dry-run, and CMIS stream integration.