generated from coulomb/repo-seed
129 lines
5.8 KiB
Markdown
129 lines
5.8 KiB
Markdown
# Ingestion Implementation Note
|
|
|
|
Date: 2026-05-06
|
|
|
|
Status: first implementation slice for `KONT-WP-0006`.
|
|
|
|
## Purpose
|
|
|
|
This note records the first governed ingestion implementation built on top of
|
|
the asset registry service. It turns local source files into stable knowledge
|
|
assets with source and normalized representations while preserving source
|
|
provenance, actor context, auditability, and inspectable ingestion job state.
|
|
|
|
## Implemented Package Shape
|
|
|
|
```text
|
|
src/kontextual_engine/
|
|
core/ingestion.py
|
|
ports/ingestion.py
|
|
services/ingestion_service.py
|
|
adapters/local_files/
|
|
adapters/builtin_extractors/
|
|
adapters/markitect_tool/
|
|
```
|
|
|
|
The new `AssetIngestionService` is separate from the older artifact-era
|
|
`IngestionService` compatibility facade in `src/kontextual_engine/ingestion.py`.
|
|
|
|
## Implemented Capabilities
|
|
|
|
- Durable `IngestionJob` state with queued, running, completed, failed,
|
|
partially completed, retried, quarantined, and canceled statuses.
|
|
- Structured ingestion failures with retriable flags and diagnostic details.
|
|
- Connector and extractor port contracts owned by the engine.
|
|
- Local file connector with source references, checksums, media type detection,
|
|
file metadata, and directory file iteration.
|
|
- Explicit ingestion identity policy with conservative source-location identity
|
|
by default and opt-in content-digest identity for governed file move/rename
|
|
reconciliation.
|
|
- Plain text extractor producing a normalized engine representation.
|
|
- Plain text structural output includes lines, paragraphs, link extraction,
|
|
confidence, and extractor metadata.
|
|
- CSV/TSV dataset extractor producing structured normalized table output with
|
|
columns, row counts, and table metadata.
|
|
- CSV/TSV structural output includes table schemas, sample rows, link
|
|
extraction from cell values, confidence, and extractor metadata.
|
|
- PDF and office document placeholder extractor that represents binary
|
|
documents as governed assets while reporting metadata-only extraction depth.
|
|
- PDF and office placeholder output exposes unsupported elements, unsupported
|
|
counts, confidence, and extractor metadata.
|
|
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
|
|
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
|
|
when available.
|
|
- Markdown normalized structure preserves Markitect-provided blocks, headings,
|
|
sections, snapshot metadata, table blocks, and link tokens where available.
|
|
- Missing `markitect-tool` dependency fails through structured
|
|
`AdapterUnavailableError` diagnostics instead of falling back to local
|
|
Markdown parsing.
|
|
- Synchronous first-run ingestion flow that creates governed assets through
|
|
`AssetRegistryService`.
|
|
- Source and normalized `AssetRepresentation` records for ingested files.
|
|
- Metadata records for connector, extractor, source digest, source media type,
|
|
size, and extraction metadata.
|
|
- Failed unsupported-media ingestion records job failure without adding an asset
|
|
to the trusted registry.
|
|
- Directory ingestion with per-file child jobs and partial result accounting.
|
|
- Directory item results distinguish succeeded, skipped, failed, quarantined,
|
|
and retriable failure state.
|
|
- Re-ingestion can update an existing asset with new source references and
|
|
source/normalized representations instead of creating a second asset.
|
|
- Unchanged source re-ingestion can be skipped without creating a new asset
|
|
version.
|
|
- Ingestion validation runs after extraction and before registry writes.
|
|
- Invalid normalized output, missing source provenance, checksum mismatch, low
|
|
confidence without explicit unsupported elements, missing extractor
|
|
provenance, or denied source permission context quarantine the job without
|
|
creating a trusted asset.
|
|
- Source permission context is preserved on source/normalized representation
|
|
metadata and as a metadata record when present.
|
|
- In-memory and SQLite job persistence.
|
|
|
|
## Current SQLite Additions
|
|
|
|
- `ingestion_jobs`
|
|
|
|
The table stores indexed status, actor, correlation ID, timestamps, and a JSON
|
|
payload for the full job contract.
|
|
|
|
## markitect-tool Boundary
|
|
|
|
Markdown ingestion uses `MarkitectMarkdownExtractor`, which imports
|
|
`markitect-tool` only inside the adapter. The engine preserves Markitect output
|
|
as normalized structure and adapter metadata; it does not make Markitect
|
|
document classes part of the engine domain model.
|
|
|
|
## Not Yet Implemented
|
|
|
|
- Asynchronous job runner and queue dispatch.
|
|
- Deep normalized structure for tables, links, embedded references, and fields
|
|
beyond extractor-provided metadata and current text/Markdown/CSV baselines.
|
|
- Optional deep PDF and office document extraction adapters.
|
|
- Enterprise policy adapter integration for ingestion-time policy decisions.
|
|
|
|
## Test Coverage
|
|
|
|
`tests/test_asset_ingestion_service.py` covers:
|
|
|
|
- plain text local-file ingestion into governed assets,
|
|
- source and normalized representation creation,
|
|
- job persistence and status inspection,
|
|
- unsupported media failure without trusted asset creation,
|
|
- directory partial success/failure accounting,
|
|
- directory skipped item and retriable failure reporting,
|
|
- content-digest identity preserving asset identity across file moves,
|
|
- unchanged source re-ingestion skip behavior,
|
|
- Markitect markdown adapter delegation and missing-dependency behavior,
|
|
- CSV dataset structured normalization,
|
|
- normalized structure coverage for text, Markdown, CSV, and document
|
|
placeholders,
|
|
- PDF and office placeholder ingestion with explicit unsupported-depth
|
|
diagnostics,
|
|
- ingestion validation/quarantine before registry writes,
|
|
- permission-context preservation on trusted ingested assets,
|
|
- directory reporting for succeeded, skipped, failed, quarantined, and
|
|
retriable items,
|
|
- optional Markitect integration contract tests for parser, selector,
|
|
operation, snapshot, context package, contract, and schema behavior,
|
|
- SQLite reload preserving ingestion jobs and ingested asset state.
|