--- id: KONT-WP-0006 type: workplan title: "Multi-Format Ingestion And Normalization" domain: markitect repo: kontextual-engine status: done owner: codex topic_slug: markitect planning_priority: high planning_order: 6 created: "2026-05-05" updated: "2026-05-06" state_hub_workstream_id: "270c83c0-eaed-4143-99d0-bb3fcfd23758" --- # KONT-WP-0006: Multi-Format Ingestion And Normalization ## Purpose Implement ingestion as an observable, retryable, provenance-preserving job system that can bring heterogeneous information assets into the engine and normalize them into a common representation for retrieval, metadata, relationships, transformations, workflows, and agent context. ## Requirement Coverage Primary: FR-020 to FR-030. Supporting: FR-001 to FR-008, FR-022 to FR-028, FR-200 to FR-202, FR-240 to FR-244. ## Architecture Constraint Implement ingestion through connector and extractor ports described in `docs/architecture-blueprint.md`. Format-specific parsing, local filesystem access, `markitect-tool`, PDF/document libraries, and dataset readers must live behind adapters, not in the domain core. ## markitect-tool Boundary Remark Markdown ingestion must use `markitect-tool` for Markdown parsing, frontmatter, headings, sections, selectors, includes, contract checks where needed, and snapshot identity. The engine should normalize Markitect results into its common representation and preserve source/adapter provenance rather than rebuilding Markdown syntax behavior. ## Implementation Status As of 2026-05-06, the first ingestion slice is recorded in `docs/ingestion-implementation.md`. It establishes ingestion job primitives, connector/extractor ports, local file ingestion, plain text normalization, Markitect markdown adapter boundaries, directory partial-result reporting, and in-memory/SQLite job persistence. It now also includes explicit ingestion identity policy, content-digest identity for governed file move/rename reconciliation, unchanged-source skip behavior, and directory item retry/skipped reporting. CSV/TSV datasets now produce structured normalized table output, and PDF/office-like files can enter the governed asset set through metadata-only placeholder extraction with explicit unsupported-depth diagnostics. Normalized structure now covers text lines/paragraphs/links, CSV/TSV schemas/tables/sample rows/links, Markitect-provided Markdown blocks/headings/sections/link tokens, and document placeholder unsupported elements. Remaining work is focused on async execution and optional deep non-text extraction adapters, which are deferred to adjacent workplans. This WP-0006 foundation slice is complete. ## I6.1 - Implement ingestion job model status and retry surface ```task id: KONT-WP-0006-T001 status: done priority: high state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753" ``` Define ingestion jobs that support queued, running, completed, failed, partially completed, retried, quarantined, and canceled states. Acceptance: - Ingestion requests return job IDs and correlation IDs. - Job status exposes input, actor, source reference, output assets, failures, retry options, and partial results. - Failed ingestion does not silently enter the trusted asset set. ## I6.2 - Implement connector and extractor contracts ```task id: KONT-WP-0006-T002 status: done priority: high state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1" ``` Define source connector and format extractor protocols that can provide source references, metadata, permission context, content streams, and normalized outputs. Acceptance: - Connectors can describe capabilities and supported source types. - Extractors can describe supported media types and extraction depth. - External extraction results can be accepted with provenance. ## I6.3 - Implement local file and directory ingestion ```task id: KONT-WP-0006-T003 status: done priority: high state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925" ``` Create the first concrete source connector for local files and directories. Acceptance: - Local files can be ingested as source-referenced knowledge assets. - Directory ingestion reports per-file success, skip, failure, and retry state. - File path changes can be represented without changing stable asset identity when identity policy permits. Implemented: - `IngestionIdentityPolicy.SOURCE_LOCATION` remains the conservative default. - `IngestionIdentityPolicy.CONTENT_DIGEST` preserves asset identity across file moves or renames when the caller opts into content identity. - Existing assets receive a versioned `asset.ingest.update` record with new source references and representations. - Re-ingesting an unchanged source is reported as a skipped child item without creating another asset version. ## I6.4 - Implement text and markdown normalization via markitect-tool adapter ```task id: KONT-WP-0006-T004 status: done priority: high state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f" ``` Normalize plain text directly and markdown through `markitect-tool` adapter boundaries, without reimplementing markdown syntax primitives here. Acceptance: - Plain text produces normalized text representation and source provenance. - Markdown extraction delegates to `markitect-tool` when available. - Missing adapter dependencies fail with structured adapter errors. - Parser, selector extraction, and snapshot identity behavior are covered by the Markitect integration contract tests. Implemented: - Plain text normalization produces source-grounded normalized representations. - Markdown normalization imports and calls `markitect-tool` only inside the adapter boundary. - Missing `markitect-tool` raises structured `AdapterUnavailableError` diagnostics. - Adapter unit tests verify delegation and missing-dependency behavior. - Optional contract tests verify parser, selector extraction, operations, snapshot identity, context packages, contracts, and schema behavior against the local `markitect-tool` checkout when available. ## I6.5 - Implement PDF office document and dataset baseline adapters ```task id: KONT-WP-0006-T005 status: done priority: high state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd" ``` Provide baseline ingestion adapters for PDFs, office-like documents, and structured datasets using optional dependencies or adapter stubs with explicit capability reporting. Acceptance: - Baseline formats can be represented as knowledge assets. - Unsupported extraction depth is reported explicitly. - CSV or table-like datasets produce structured normalized output. Implemented: - `CsvDatasetExtractor` supports CSV and TSV sources with structured columns, row counts, table metadata, and normalized dataset fields. - `DocumentPlaceholderExtractor` supports PDF and common office media types as metadata-only assets with `extraction.depth_unsupported` diagnostics. - Local file media-type detection is explicit for CSV, TSV, PDF, DOC/DOCX, XLS/XLSX, and PPT/PPTX. ## I6.6 - Extract structural elements into common normalized representation ```task id: KONT-WP-0006-T006 status: done priority: medium state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542" ``` Represent titles, sections, headings, paragraphs, tables, links, embedded references, fields, and confidence signals where extractors can recover them. Acceptance: - Normalized representation supports text, structure, tables, links, and extractor metadata. - Structural output can feed search, snippets, transformations, and context packages. - Extractor confidence and unsupported elements are visible. Implemented: - The shared `NormalizedDocument` contract already carries text, structure, tables, links, fields, confidence, unsupported elements, and extractor metadata. - Plain text emits line and paragraph units plus URL links. - CSV/TSV emits table schema, rows, sample rows, and URL links from cells. - Markdown maps Markitect-provided blocks, headings, sections, table blocks, link tokens, and snapshot metadata without reimplementing Markdown parsing. - PDF/office placeholders emit unsupported element structure and low-confidence metadata-only normalized output. ## I6.7 - Validate ingestion output quarantine failures and preserve provenance ```task id: KONT-WP-0006-T007 status: done priority: medium state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c" ``` Validate normalized content, required metadata, source provenance, permissions, and policy constraints before ingestion completion. Acceptance: - Invalid output is quarantined or failed with structured diagnostics. - Re-ingestion preserves identity, provenance, permissions, versions, and relationships where policy allows. - Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable items separately. Implemented: - Ingestion validates extraction output before registry writes. - Empty normalized output, missing/mismatched source checksum provenance, missing extractor metadata, denied source permission context, and low confidence without explicit unsupported elements quarantine the file job without adding a trusted asset. - Source permission context is preserved on representations and metadata records for accepted assets. - Directory ingestion reports succeeded, failed, skipped, quarantined, and retriable item counts separately. ## Definition Of Done - Local file, text, markdown, PDF/document placeholder, and dataset ingestion scenarios are covered by tests. - Job status and provenance are inspectable through programmatic APIs. - Connector and extractor boundaries follow `docs/architecture-blueprint.md`. - `python3 -m pytest` passes.