Files
kontextual-engine/docs/ingestion-implementation.md

85 lines
3.2 KiB
Markdown

# Ingestion Implementation Note
Date: 2026-05-06
Status: first implementation slice for `KONT-WP-0006`.
## Purpose
This note records the first governed ingestion implementation built on top of
the asset registry service. It turns local source files into stable knowledge
assets with source and normalized representations while preserving source
provenance, actor context, auditability, and inspectable ingestion job state.
## Implemented Package Shape
```text
src/kontextual_engine/
core/ingestion.py
ports/ingestion.py
services/ingestion_service.py
adapters/local_files/
adapters/builtin_extractors/
adapters/markitect_tool/
```
The new `AssetIngestionService` is separate from the older artifact-era
`IngestionService` compatibility facade in `src/kontextual_engine/ingestion.py`.
## Implemented Capabilities
- Durable `IngestionJob` state with queued, running, completed, failed,
partially completed, retried, quarantined, and canceled statuses.
- Structured ingestion failures with retriable flags and diagnostic details.
- Connector and extractor port contracts owned by the engine.
- Local file connector with source references, checksums, media type detection,
file metadata, and directory file iteration.
- Plain text extractor producing a normalized engine representation.
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
when available.
- Synchronous first-run ingestion flow that creates governed assets through
`AssetRegistryService`.
- Source and normalized `AssetRepresentation` records for ingested files.
- Metadata records for connector, extractor, source digest, source media type,
size, and extraction metadata.
- Failed unsupported-media ingestion records job failure without adding an asset
to the trusted registry.
- Directory ingestion with per-file child jobs and partial result accounting.
- In-memory and SQLite job persistence.
## Current SQLite Additions
- `ingestion_jobs`
The table stores indexed status, actor, correlation ID, timestamps, and a JSON
payload for the full job contract.
## markitect-tool Boundary
Markdown ingestion uses `MarkitectMarkdownExtractor`, which imports
`markitect-tool` only inside the adapter. The engine preserves Markitect output
as normalized structure and adapter metadata; it does not make Markitect
document classes part of the engine domain model.
## Not Yet Implemented
- Asynchronous job runner and queue dispatch.
- Re-ingestion reconciliation for existing assets.
- Identity policies that preserve asset identity across source moves.
- PDF, office document, and dataset extractors.
- Deep normalized structure for tables, links, embedded references, and fields
beyond extractor-provided metadata.
- Quarantine policy checks beyond unsupported/failed extraction paths.
## Test Coverage
`tests/test_asset_ingestion_service.py` covers:
- plain text local-file ingestion into governed assets,
- source and normalized representation creation,
- job persistence and status inspection,
- unsupported media failure without trusted asset creation,
- directory partial success/failure accounting,
- SQLite reload preserving ingestion jobs and ingested asset state.