Files
kontextual-engine/workplans/KONT-WP-0013-blob-storage-content-streaming.md

187 lines
5.8 KiB
Markdown

---
id: KONT-WP-0013
type: workplan
title: "Blob Storage Deduplication And Content Streaming"
domain: markitect
repo: kontextual-engine
status: completed
owner: codex
topic_slug: markitect
planning_priority: high
planning_order: 13
created: "2026-05-07"
updated: "2026-05-07"
state_hub_workstream_id: "21355091-1ebe-4662-983c-4795deea2adc"
---
# KONT-WP-0013: Blob Storage Deduplication And Content Streaming
## Purpose
Implement efficient blob handling and content streaming for
`kontextual-engine`. The engine should store and expose representation bytes
through governed, deduplicating interfaces while preserving the existing
asset/representation/provenance model.
## References
- `docs/blob-storage-content-streaming-workplan.md`
- `docs/architecture-blueprint.md`
- `docs/cmis-deployment-compatibility.md`
- `src/kontextual_engine/core/assets.py`
- `src/kontextual_engine/core/cmis.py`
## Boundary
This workplan adds content-addressed blob infrastructure and stream interfaces.
It does not introduce AtomPub, SOAP/Web Services, chunk-level deduplication, or
a general document-management storage model.
It includes an optional S3 backend as an infrastructure adapter behind the same
blob storage port. S3 object keys are digest-derived, so object storage can be
used without changing engine semantics or CMIS profile governance.
## Architecture Constraint
Blob bytes are infrastructure state. Engine semantics remain attached to
`AssetRepresentation`, `AssetVersion`, policy decisions, audit records,
source references, and lineage. CMIS content streaming must delegate through
engine-native content services instead of bypassing governance.
## D13.1 - Define blob storage port and blob reference model
```task
id: KONT-WP-0013-T001
status: done
priority: high
state_hub_task_id: "6bb5b49a-cf9f-47ce-86d3-24b47a20a2c6"
```
Acceptance:
- A stable blob storage port is defined for put, read/open, stat, exists, and
delete-unreferenced operations.
- Blob references include digest, size, media type where appropriate, storage
key, and adapter name.
- Port contracts support deterministic tests without requiring filesystem or
external object storage.
## D13.2 - Implement content-addressed local blob adapter
```task
id: KONT-WP-0013-T002
status: done
priority: high
state_hub_task_id: "661386c7-8094-4f0f-928c-c17f5b3a9132"
```
Acceptance:
- Local adapter stores blobs by digest-derived path.
- Writes are idempotent and verify digest/size.
- Duplicate content does not duplicate stored bytes.
- Atomic write behavior is covered by tests where practical.
## D13.3 - Implement representation content service
```task
id: KONT-WP-0013-T003
status: done
priority: high
state_hub_task_id: "00bc34c5-0f79-47b6-b305-f47311edd3a7"
```
Acceptance:
- Service creates `AssetRepresentation` records from bytes through blob storage.
- Existing asset service mutation paths can delegate to the content service.
- Content changes create versions and audit events.
- Existing opaque `storage_ref` behavior remains compatible.
## D13.4 - Add blob reference accounting and safe cleanup
```task
id: KONT-WP-0013-T004
status: done
priority: medium
state_hub_task_id: "cc4445d9-f773-4337-afd4-aeccc743dc1e"
```
Acceptance:
- Referenced blob discovery is deterministic from representation records.
- Cleanup can identify unreferenced blobs without deleting active content.
- Dry-run cleanup reports reclaimable bytes and references.
## D13.5 - Expose engine-native content stream interfaces
```task
id: KONT-WP-0013-T005
status: done
priority: high
state_hub_task_id: "db0e8a2d-50ce-439c-8393-d65e2fc4bc9e"
```
Acceptance:
- Service/runtime methods can fetch representation bytes or stream handles.
- Response metadata includes digest, size, media type, and representation ID.
- Policy checks happen before bytes are exposed.
- Tests cover source, normalized, and derived representation reads.
## D13.6 - Integrate CMIS content stream byte semantics
```task
id: KONT-WP-0013-T006
status: done
priority: high
state_hub_task_id: "2f1da1fb-9634-4ba6-931a-3e29394efd37"
```
Acceptance:
- CMIS `getContentStream` can return actual bytes/stream semantics when content
is locally available.
- CMIS `setContentStream` writes through the deduplicating content service.
- CMIS descriptors remain available for metadata-only clients.
- Unsupported external/opaque storage refs produce structured diagnostics.
## D13.7 - Document deployment, migration, and capacity posture
```task
id: KONT-WP-0013-T007
status: done
priority: medium
state_hub_task_id: "987ad4f6-8658-4e93-82c2-b9fa0a3a2270"
```
Acceptance:
- Blob root configuration and local adapter behavior are documented.
- Existing representations with opaque `storage_ref` values have a migration
posture.
- Capacity tests demonstrate dedupe effectiveness and avoid excessive fixture
size.
- Operational cleanup guidance is documented.
## Definition Of Done
- Blob storage port and local content-addressed adapter exist.
- Representation bytes can be stored, deduplicated, read, and governed.
- CMIS content stream routes can expose real bytes when available.
- Existing tests continue to pass.
- Focused dedupe/content-stream tests cover duplicate content, readback,
policy denial, cleanup dry-run, and CMIS integration.
## Completion Notes
- Implemented `BlobStorage` port with `put_bytes`, `read_bytes`, `iter_bytes`,
`stat`, `exists`, and `delete_unreferenced`.
- Added in-memory, local filesystem, and optional S3 content-addressed adapters.
- Added governed representation content service for byte-backed
representations, chunked streams, policy checks, audit events, and cleanup.
- Wired CMIS `setContentStream` and byte stream routes through the content
service; repeated content updates now expose the latest source representation.
- Added tests for dedupe, local/S3 adapter behavior, content-kind reads, policy
denial, cleanup dry-run, and CMIS stream integration.