Files
kontextual-engine/docs/ingestion-implementation.md

129 lines
5.8 KiB
Markdown

# Ingestion Implementation Note
Date: 2026-05-06
Status: first implementation slice for `KONT-WP-0006`.
## Purpose
This note records the first governed ingestion implementation built on top of
the asset registry service. It turns local source files into stable knowledge
assets with source and normalized representations while preserving source
provenance, actor context, auditability, and inspectable ingestion job state.
## Implemented Package Shape
```text
src/kontextual_engine/
core/ingestion.py
ports/ingestion.py
services/ingestion_service.py
adapters/local_files/
adapters/builtin_extractors/
adapters/markitect_tool/
```
The new `AssetIngestionService` is separate from the older artifact-era
`IngestionService` compatibility facade in `src/kontextual_engine/ingestion.py`.
## Implemented Capabilities
- Durable `IngestionJob` state with queued, running, completed, failed,
partially completed, retried, quarantined, and canceled statuses.
- Structured ingestion failures with retriable flags and diagnostic details.
- Connector and extractor port contracts owned by the engine.
- Local file connector with source references, checksums, media type detection,
file metadata, and directory file iteration.
- Explicit ingestion identity policy with conservative source-location identity
by default and opt-in content-digest identity for governed file move/rename
reconciliation.
- Plain text extractor producing a normalized engine representation.
- Plain text structural output includes lines, paragraphs, link extraction,
confidence, and extractor metadata.
- CSV/TSV dataset extractor producing structured normalized table output with
columns, row counts, and table metadata.
- CSV/TSV structural output includes table schemas, sample rows, link
extraction from cell values, confidence, and extractor metadata.
- PDF and office document placeholder extractor that represents binary
documents as governed assets while reporting metadata-only extraction depth.
- PDF and office placeholder output exposes unsupported elements, unsupported
counts, confidence, and extractor metadata.
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
when available.
- Markdown normalized structure preserves Markitect-provided blocks, headings,
sections, snapshot metadata, table blocks, and link tokens where available.
- Missing `markitect-tool` dependency fails through structured
`AdapterUnavailableError` diagnostics instead of falling back to local
Markdown parsing.
- Synchronous first-run ingestion flow that creates governed assets through
`AssetRegistryService`.
- Source and normalized `AssetRepresentation` records for ingested files.
- Metadata records for connector, extractor, source digest, source media type,
size, and extraction metadata.
- Failed unsupported-media ingestion records job failure without adding an asset
to the trusted registry.
- Directory ingestion with per-file child jobs and partial result accounting.
- Directory item results distinguish succeeded, skipped, failed, quarantined,
and retriable failure state.
- Re-ingestion can update an existing asset with new source references and
source/normalized representations instead of creating a second asset.
- Unchanged source re-ingestion can be skipped without creating a new asset
version.
- Ingestion validation runs after extraction and before registry writes.
- Invalid normalized output, missing source provenance, checksum mismatch, low
confidence without explicit unsupported elements, missing extractor
provenance, or denied source permission context quarantine the job without
creating a trusted asset.
- Source permission context is preserved on source/normalized representation
metadata and as a metadata record when present.
- In-memory and SQLite job persistence.
## Current SQLite Additions
- `ingestion_jobs`
The table stores indexed status, actor, correlation ID, timestamps, and a JSON
payload for the full job contract.
## markitect-tool Boundary
Markdown ingestion uses `MarkitectMarkdownExtractor`, which imports
`markitect-tool` only inside the adapter. The engine preserves Markitect output
as normalized structure and adapter metadata; it does not make Markitect
document classes part of the engine domain model.
## Not Yet Implemented
- Asynchronous job runner and queue dispatch.
- Deep normalized structure for tables, links, embedded references, and fields
beyond extractor-provided metadata and current text/Markdown/CSV baselines.
- Optional deep PDF and office document extraction adapters.
- Enterprise policy adapter integration for ingestion-time policy decisions.
## Test Coverage
`tests/test_asset_ingestion_service.py` covers:
- plain text local-file ingestion into governed assets,
- source and normalized representation creation,
- job persistence and status inspection,
- unsupported media failure without trusted asset creation,
- directory partial success/failure accounting,
- directory skipped item and retriable failure reporting,
- content-digest identity preserving asset identity across file moves,
- unchanged source re-ingestion skip behavior,
- Markitect markdown adapter delegation and missing-dependency behavior,
- CSV dataset structured normalization,
- normalized structure coverage for text, Markdown, CSV, and document
placeholders,
- PDF and office placeholder ingestion with explicit unsupported-depth
diagnostics,
- ingestion validation/quarantine before registry writes,
- permission-context preservation on trusted ingested assets,
- directory reporting for succeeded, skipped, failed, quarantined, and
retriable items,
- optional Markitect integration contract tests for parser, selector,
operation, snapshot, context package, contract, and schema behavior,
- SQLite reload preserving ingestion jobs and ingested asset state.