generated from coulomb/repo-seed
blob deduplication and content streaming
This commit is contained in:
88
docs/blob-storage-content-streaming-workplan.md
Normal file
88
docs/blob-storage-content-streaming-workplan.md
Normal file
@@ -0,0 +1,88 @@
|
||||
# Blob Storage And Content Streaming Workplan
|
||||
|
||||
Date: 2026-05-07
|
||||
|
||||
Status: planned.
|
||||
|
||||
## Purpose
|
||||
|
||||
Add efficient, governed blob handling to `kontextual-engine` so source,
|
||||
normalized, and derived representations can reference real content bytes
|
||||
without duplicating storage. Expose those bytes through engine-native
|
||||
interfaces and CMIS content stream routes.
|
||||
|
||||
## Current State
|
||||
|
||||
The engine already records representation metadata:
|
||||
|
||||
- digest,
|
||||
- size,
|
||||
- media type,
|
||||
- representation kind,
|
||||
- opaque `storage_ref`.
|
||||
|
||||
It does not yet provide:
|
||||
|
||||
- a content-addressed blob store,
|
||||
- deduplicating writes,
|
||||
- blob read/stream interfaces,
|
||||
- reference accounting or garbage collection,
|
||||
- CMIS byte-stream download semantics.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
```text
|
||||
bytes
|
||||
-> digest/size verification
|
||||
-> BlobStoragePort
|
||||
-> content-addressed adapter
|
||||
-> AssetRepresentation storage_ref
|
||||
-> governed representation service
|
||||
-> service API / CMIS content stream
|
||||
```
|
||||
|
||||
Core rule: blob storage is infrastructure. `AssetRepresentation` remains the
|
||||
engine-level content reference with provenance, digest, media type, kind,
|
||||
version, policy, and audit context.
|
||||
|
||||
## Dedupe Strategy
|
||||
|
||||
Use whole-object content-addressing first:
|
||||
|
||||
- digest: `sha256:<hex>`,
|
||||
- storage key: digest-derived path or adapter key,
|
||||
- idempotent write: if digest already exists, reuse blob,
|
||||
- verify size and digest before returning a blob reference,
|
||||
- representation rows may duplicate references but must not duplicate bytes.
|
||||
|
||||
Chunk-level deduplication is explicitly deferred until large-file evidence
|
||||
justifies the complexity.
|
||||
|
||||
## Interfaces
|
||||
|
||||
Planned engine-native interfaces:
|
||||
|
||||
- `BlobStoragePort.put_bytes(...)`
|
||||
- `BlobStoragePort.open_bytes(...)`
|
||||
- `BlobStoragePort.stat(...)`
|
||||
- `BlobStoragePort.exists(...)`
|
||||
- `BlobStoragePort.delete_unreferenced(...)`
|
||||
- `RepresentationContentService.add_representation_from_bytes(...)`
|
||||
- `RepresentationContentService.get_content_stream(...)`
|
||||
|
||||
Planned CMIS integration:
|
||||
|
||||
- `getContentStream` returns actual bytes/stream with content headers,
|
||||
- `setContentStream` stores through deduplicating representation service,
|
||||
- content stream changes produce versions and audit events,
|
||||
- descriptors remain available for clients that only need metadata.
|
||||
|
||||
## Risks
|
||||
|
||||
- Large files may require streaming APIs rather than in-memory bytes.
|
||||
- Local filesystem adapters need atomic writes and digest verification.
|
||||
- Garbage collection must never delete referenced blobs.
|
||||
- Security must treat blob bytes as governed content, not public storage.
|
||||
- Existing `storage_ref` values may point to external sources and should remain
|
||||
valid as opaque references.
|
||||
|
||||
170
workplans/KONT-WP-0013-blob-storage-content-streaming.md
Normal file
170
workplans/KONT-WP-0013-blob-storage-content-streaming.md
Normal file
@@ -0,0 +1,170 @@
|
||||
---
|
||||
id: KONT-WP-0013
|
||||
type: workplan
|
||||
title: "Blob Storage Deduplication And Content Streaming"
|
||||
domain: markitect
|
||||
repo: kontextual-engine
|
||||
status: active
|
||||
owner: codex
|
||||
topic_slug: markitect
|
||||
planning_priority: high
|
||||
planning_order: 13
|
||||
created: "2026-05-07"
|
||||
updated: "2026-05-07"
|
||||
state_hub_workstream_id: "21355091-1ebe-4662-983c-4795deea2adc"
|
||||
---
|
||||
|
||||
# KONT-WP-0013: Blob Storage Deduplication And Content Streaming
|
||||
|
||||
## Purpose
|
||||
|
||||
Implement efficient blob handling and content streaming for
|
||||
`kontextual-engine`. The engine should store and expose representation bytes
|
||||
through governed, deduplicating interfaces while preserving the existing
|
||||
asset/representation/provenance model.
|
||||
|
||||
## References
|
||||
|
||||
- `docs/blob-storage-content-streaming-workplan.md`
|
||||
- `docs/architecture-blueprint.md`
|
||||
- `docs/cmis-deployment-compatibility.md`
|
||||
- `src/kontextual_engine/core/assets.py`
|
||||
- `src/kontextual_engine/core/cmis.py`
|
||||
|
||||
## Boundary
|
||||
|
||||
This workplan adds content-addressed blob infrastructure and stream interfaces.
|
||||
It does not introduce AtomPub, SOAP/Web Services, chunk-level deduplication, or
|
||||
a general document-management storage model.
|
||||
|
||||
## Architecture Constraint
|
||||
|
||||
Blob bytes are infrastructure state. Engine semantics remain attached to
|
||||
`AssetRepresentation`, `AssetVersion`, policy decisions, audit records,
|
||||
source references, and lineage. CMIS content streaming must delegate through
|
||||
engine-native content services instead of bypassing governance.
|
||||
|
||||
## D13.1 - Define blob storage port and blob reference model
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "6bb5b49a-cf9f-47ce-86d3-24b47a20a2c6"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A stable blob storage port is defined for put, read/open, stat, exists, and
|
||||
delete-unreferenced operations.
|
||||
- Blob references include digest, size, media type where appropriate, storage
|
||||
key, and adapter name.
|
||||
- Port contracts support deterministic tests without requiring filesystem or
|
||||
external object storage.
|
||||
|
||||
## D13.2 - Implement content-addressed local blob adapter
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T002
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "661386c7-8094-4f0f-928c-c17f5b3a9132"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Local adapter stores blobs by digest-derived path.
|
||||
- Writes are idempotent and verify digest/size.
|
||||
- Duplicate content does not duplicate stored bytes.
|
||||
- Atomic write behavior is covered by tests where practical.
|
||||
|
||||
## D13.3 - Implement representation content service
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T003
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "00bc34c5-0f79-47b6-b305-f47311edd3a7"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Service creates `AssetRepresentation` records from bytes through blob storage.
|
||||
- Existing asset service mutation paths can delegate to the content service.
|
||||
- Content changes create versions and audit events.
|
||||
- Existing opaque `storage_ref` behavior remains compatible.
|
||||
|
||||
## D13.4 - Add blob reference accounting and safe cleanup
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T004
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "cc4445d9-f773-4337-afd4-aeccc743dc1e"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Referenced blob discovery is deterministic from representation records.
|
||||
- Cleanup can identify unreferenced blobs without deleting active content.
|
||||
- Dry-run cleanup reports reclaimable bytes and references.
|
||||
|
||||
## D13.5 - Expose engine-native content stream interfaces
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T005
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "db0e8a2d-50ce-439c-8393-d65e2fc4bc9e"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Service/runtime methods can fetch representation bytes or stream handles.
|
||||
- Response metadata includes digest, size, media type, and representation ID.
|
||||
- Policy checks happen before bytes are exposed.
|
||||
- Tests cover source, normalized, and derived representation reads.
|
||||
|
||||
## D13.6 - Integrate CMIS content stream byte semantics
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T006
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "2f1da1fb-9634-4ba6-931a-3e29394efd37"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- CMIS `getContentStream` can return actual bytes/stream semantics when content
|
||||
is locally available.
|
||||
- CMIS `setContentStream` writes through the deduplicating content service.
|
||||
- CMIS descriptors remain available for metadata-only clients.
|
||||
- Unsupported external/opaque storage refs produce structured diagnostics.
|
||||
|
||||
## D13.7 - Document deployment, migration, and capacity posture
|
||||
|
||||
```task
|
||||
id: KONT-WP-0013-T007
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "987ad4f6-8658-4e93-82c2-b9fa0a3a2270"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Blob root configuration and local adapter behavior are documented.
|
||||
- Existing representations with opaque `storage_ref` values have a migration
|
||||
posture.
|
||||
- Capacity tests demonstrate dedupe effectiveness and avoid excessive fixture
|
||||
size.
|
||||
- Operational cleanup guidance is documented.
|
||||
|
||||
## Definition Of Done
|
||||
|
||||
- Blob storage port and local content-addressed adapter exist.
|
||||
- Representation bytes can be stored, deduplicated, read, and governed.
|
||||
- CMIS content stream routes can expose real bytes when available.
|
||||
- Existing tests continue to pass.
|
||||
- Focused dedupe/content-stream tests cover duplicate content, readback,
|
||||
policy denial, cleanup dry-run, and CMIS integration.
|
||||
Reference in New Issue
Block a user