Files
kontextual-engine/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md

6.3 KiB

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug planning_priority planning_order created updated state_hub_workstream_id
KONT-WP-0006 workplan Multi-Format Ingestion And Normalization markitect kontextual-engine active codex markitect high 6 2026-05-05 2026-05-06 270c83c0-eaed-4143-99d0-bb3fcfd23758

KONT-WP-0006: Multi-Format Ingestion And Normalization

Purpose

Implement ingestion as an observable, retryable, provenance-preserving job system that can bring heterogeneous information assets into the engine and normalize them into a common representation for retrieval, metadata, relationships, transformations, workflows, and agent context.

Requirement Coverage

Primary: FR-020 to FR-030.

Supporting: FR-001 to FR-008, FR-022 to FR-028, FR-200 to FR-202, FR-240 to FR-244.

Architecture Constraint

Implement ingestion through connector and extractor ports described in docs/architecture-blueprint.md. Format-specific parsing, local filesystem access, markitect-tool, PDF/document libraries, and dataset readers must live behind adapters, not in the domain core.

markitect-tool Boundary Remark

Markdown ingestion must use markitect-tool for Markdown parsing, frontmatter, headings, sections, selectors, includes, contract checks where needed, and snapshot identity. The engine should normalize Markitect results into its common representation and preserve source/adapter provenance rather than rebuilding Markdown syntax behavior.

Implementation Status

As of 2026-05-06, the first ingestion slice is recorded in docs/ingestion-implementation.md. It establishes ingestion job primitives, connector/extractor ports, local file ingestion, plain text normalization, Markitect markdown adapter boundaries, directory partial-result reporting, and in-memory/SQLite job persistence. Remaining work is focused on async execution, re-ingestion identity reconciliation, richer structural extraction, quarantine policy checks, and non-text format adapters.

I6.1 - Implement ingestion job model status and retry surface

id: KONT-WP-0006-T001
status: done
priority: high
state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753"

Define ingestion jobs that support queued, running, completed, failed, partially completed, retried, quarantined, and canceled states.

Acceptance:

  • Ingestion requests return job IDs and correlation IDs.
  • Job status exposes input, actor, source reference, output assets, failures, retry options, and partial results.
  • Failed ingestion does not silently enter the trusted asset set.

I6.2 - Implement connector and extractor contracts

id: KONT-WP-0006-T002
status: done
priority: high
state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1"

Define source connector and format extractor protocols that can provide source references, metadata, permission context, content streams, and normalized outputs.

Acceptance:

  • Connectors can describe capabilities and supported source types.
  • Extractors can describe supported media types and extraction depth.
  • External extraction results can be accepted with provenance.

I6.3 - Implement local file and directory ingestion

id: KONT-WP-0006-T003
status: in_progress
priority: high
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"

Create the first concrete source connector for local files and directories.

Acceptance:

  • Local files can be ingested as source-referenced knowledge assets.
  • Directory ingestion reports per-file success, skip, failure, and retry state.
  • File path changes can be represented without changing stable asset identity when identity policy permits.

I6.4 - Implement text and markdown normalization via markitect-tool adapter

id: KONT-WP-0006-T004
status: in_progress
priority: high
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"

Normalize plain text directly and markdown through markitect-tool adapter boundaries, without reimplementing markdown syntax primitives here.

Acceptance:

  • Plain text produces normalized text representation and source provenance.
  • Markdown extraction delegates to markitect-tool when available.
  • Missing adapter dependencies fail with structured adapter errors.
  • Parser, selector extraction, and snapshot identity behavior are covered by the Markitect integration contract tests.

I6.5 - Implement PDF office document and dataset baseline adapters

id: KONT-WP-0006-T005
status: todo
priority: high
state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd"

Provide baseline ingestion adapters for PDFs, office-like documents, and structured datasets using optional dependencies or adapter stubs with explicit capability reporting.

Acceptance:

  • Baseline formats can be represented as knowledge assets.
  • Unsupported extraction depth is reported explicitly.
  • CSV or table-like datasets produce structured normalized output.

I6.6 - Extract structural elements into common normalized representation

id: KONT-WP-0006-T006
status: todo
priority: medium
state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542"

Represent titles, sections, headings, paragraphs, tables, links, embedded references, fields, and confidence signals where extractors can recover them.

Acceptance:

  • Normalized representation supports text, structure, tables, links, and extractor metadata.
  • Structural output can feed search, snippets, transformations, and context packages.
  • Extractor confidence and unsupported elements are visible.

I6.7 - Validate ingestion output quarantine failures and preserve provenance

id: KONT-WP-0006-T007
status: todo
priority: medium
state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c"

Validate normalized content, required metadata, source provenance, permissions, and policy constraints before ingestion completion.

Acceptance:

  • Invalid output is quarantined or failed with structured diagnostics.
  • Re-ingestion preserves identity, provenance, permissions, versions, and relationships where policy allows.
  • Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable items separately.

Definition Of Done

  • Local file, text, markdown, PDF/document placeholder, and dataset ingestion scenarios are covered by tests.
  • Job status and provenance are inspectable through programmatic APIs.
  • Connector and extractor boundaries follow docs/architecture-blueprint.md.
  • python3 -m pytest passes.