generated from coulomb/repo-seed
first ingestion/normalization slice
This commit is contained in:
84
docs/ingestion-implementation.md
Normal file
84
docs/ingestion-implementation.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Ingestion Implementation Note
|
||||
|
||||
Date: 2026-05-06
|
||||
|
||||
Status: first implementation slice for `KONT-WP-0006`.
|
||||
|
||||
## Purpose
|
||||
|
||||
This note records the first governed ingestion implementation built on top of
|
||||
the asset registry service. It turns local source files into stable knowledge
|
||||
assets with source and normalized representations while preserving source
|
||||
provenance, actor context, auditability, and inspectable ingestion job state.
|
||||
|
||||
## Implemented Package Shape
|
||||
|
||||
```text
|
||||
src/kontextual_engine/
|
||||
core/ingestion.py
|
||||
ports/ingestion.py
|
||||
services/ingestion_service.py
|
||||
adapters/local_files/
|
||||
adapters/builtin_extractors/
|
||||
adapters/markitect_tool/
|
||||
```
|
||||
|
||||
The new `AssetIngestionService` is separate from the older artifact-era
|
||||
`IngestionService` compatibility facade in `src/kontextual_engine/ingestion.py`.
|
||||
|
||||
## Implemented Capabilities
|
||||
|
||||
- Durable `IngestionJob` state with queued, running, completed, failed,
|
||||
partially completed, retried, quarantined, and canceled statuses.
|
||||
- Structured ingestion failures with retriable flags and diagnostic details.
|
||||
- Connector and extractor port contracts owned by the engine.
|
||||
- Local file connector with source references, checksums, media type detection,
|
||||
file metadata, and directory file iteration.
|
||||
- Plain text extractor producing a normalized engine representation.
|
||||
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
|
||||
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
|
||||
when available.
|
||||
- Synchronous first-run ingestion flow that creates governed assets through
|
||||
`AssetRegistryService`.
|
||||
- Source and normalized `AssetRepresentation` records for ingested files.
|
||||
- Metadata records for connector, extractor, source digest, source media type,
|
||||
size, and extraction metadata.
|
||||
- Failed unsupported-media ingestion records job failure without adding an asset
|
||||
to the trusted registry.
|
||||
- Directory ingestion with per-file child jobs and partial result accounting.
|
||||
- In-memory and SQLite job persistence.
|
||||
|
||||
## Current SQLite Additions
|
||||
|
||||
- `ingestion_jobs`
|
||||
|
||||
The table stores indexed status, actor, correlation ID, timestamps, and a JSON
|
||||
payload for the full job contract.
|
||||
|
||||
## markitect-tool Boundary
|
||||
|
||||
Markdown ingestion uses `MarkitectMarkdownExtractor`, which imports
|
||||
`markitect-tool` only inside the adapter. The engine preserves Markitect output
|
||||
as normalized structure and adapter metadata; it does not make Markitect
|
||||
document classes part of the engine domain model.
|
||||
|
||||
## Not Yet Implemented
|
||||
|
||||
- Asynchronous job runner and queue dispatch.
|
||||
- Re-ingestion reconciliation for existing assets.
|
||||
- Identity policies that preserve asset identity across source moves.
|
||||
- PDF, office document, and dataset extractors.
|
||||
- Deep normalized structure for tables, links, embedded references, and fields
|
||||
beyond extractor-provided metadata.
|
||||
- Quarantine policy checks beyond unsupported/failed extraction paths.
|
||||
|
||||
## Test Coverage
|
||||
|
||||
`tests/test_asset_ingestion_service.py` covers:
|
||||
|
||||
- plain text local-file ingestion into governed assets,
|
||||
- source and normalized representation creation,
|
||||
- job persistence and status inspection,
|
||||
- unsupported media failure without trusted asset creation,
|
||||
- directory partial success/failure accounting,
|
||||
- SQLite reload preserving ingestion jobs and ingested asset state.
|
||||
Reference in New Issue
Block a user