generated from coulomb/repo-seed
85 lines
3.2 KiB
Markdown
85 lines
3.2 KiB
Markdown
# Ingestion Implementation Note
|
|
|
|
Date: 2026-05-06
|
|
|
|
Status: first implementation slice for `KONT-WP-0006`.
|
|
|
|
## Purpose
|
|
|
|
This note records the first governed ingestion implementation built on top of
|
|
the asset registry service. It turns local source files into stable knowledge
|
|
assets with source and normalized representations while preserving source
|
|
provenance, actor context, auditability, and inspectable ingestion job state.
|
|
|
|
## Implemented Package Shape
|
|
|
|
```text
|
|
src/kontextual_engine/
|
|
core/ingestion.py
|
|
ports/ingestion.py
|
|
services/ingestion_service.py
|
|
adapters/local_files/
|
|
adapters/builtin_extractors/
|
|
adapters/markitect_tool/
|
|
```
|
|
|
|
The new `AssetIngestionService` is separate from the older artifact-era
|
|
`IngestionService` compatibility facade in `src/kontextual_engine/ingestion.py`.
|
|
|
|
## Implemented Capabilities
|
|
|
|
- Durable `IngestionJob` state with queued, running, completed, failed,
|
|
partially completed, retried, quarantined, and canceled statuses.
|
|
- Structured ingestion failures with retriable flags and diagnostic details.
|
|
- Connector and extractor port contracts owned by the engine.
|
|
- Local file connector with source references, checksums, media type detection,
|
|
file metadata, and directory file iteration.
|
|
- Plain text extractor producing a normalized engine representation.
|
|
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
|
|
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
|
|
when available.
|
|
- Synchronous first-run ingestion flow that creates governed assets through
|
|
`AssetRegistryService`.
|
|
- Source and normalized `AssetRepresentation` records for ingested files.
|
|
- Metadata records for connector, extractor, source digest, source media type,
|
|
size, and extraction metadata.
|
|
- Failed unsupported-media ingestion records job failure without adding an asset
|
|
to the trusted registry.
|
|
- Directory ingestion with per-file child jobs and partial result accounting.
|
|
- In-memory and SQLite job persistence.
|
|
|
|
## Current SQLite Additions
|
|
|
|
- `ingestion_jobs`
|
|
|
|
The table stores indexed status, actor, correlation ID, timestamps, and a JSON
|
|
payload for the full job contract.
|
|
|
|
## markitect-tool Boundary
|
|
|
|
Markdown ingestion uses `MarkitectMarkdownExtractor`, which imports
|
|
`markitect-tool` only inside the adapter. The engine preserves Markitect output
|
|
as normalized structure and adapter metadata; it does not make Markitect
|
|
document classes part of the engine domain model.
|
|
|
|
## Not Yet Implemented
|
|
|
|
- Asynchronous job runner and queue dispatch.
|
|
- Re-ingestion reconciliation for existing assets.
|
|
- Identity policies that preserve asset identity across source moves.
|
|
- PDF, office document, and dataset extractors.
|
|
- Deep normalized structure for tables, links, embedded references, and fields
|
|
beyond extractor-provided metadata.
|
|
- Quarantine policy checks beyond unsupported/failed extraction paths.
|
|
|
|
## Test Coverage
|
|
|
|
`tests/test_asset_ingestion_service.py` covers:
|
|
|
|
- plain text local-file ingestion into governed assets,
|
|
- source and normalized representation creation,
|
|
- job persistence and status inspection,
|
|
- unsupported media failure without trusted asset creation,
|
|
- directory partial success/failure accounting,
|
|
- SQLite reload preserving ingestion jobs and ingested asset state.
|