9.5 KiB
id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | planning_priority | planning_order | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KONT-WP-0006 | workplan | Multi-Format Ingestion And Normalization | markitect | kontextual-engine | done | codex | markitect | high | 6 | 2026-05-05 | 2026-05-06 | 270c83c0-eaed-4143-99d0-bb3fcfd23758 |
KONT-WP-0006: Multi-Format Ingestion And Normalization
Purpose
Implement ingestion as an observable, retryable, provenance-preserving job system that can bring heterogeneous information assets into the engine and normalize them into a common representation for retrieval, metadata, relationships, transformations, workflows, and agent context.
Requirement Coverage
Primary: FR-020 to FR-030.
Supporting: FR-001 to FR-008, FR-022 to FR-028, FR-200 to FR-202, FR-240 to FR-244.
Architecture Constraint
Implement ingestion through connector and extractor ports described in
docs/architecture-blueprint.md. Format-specific parsing, local filesystem
access, markitect-tool, PDF/document libraries, and dataset readers must live
behind adapters, not in the domain core.
markitect-tool Boundary Remark
Markdown ingestion must use markitect-tool for Markdown parsing,
frontmatter, headings, sections, selectors, includes, contract checks where
needed, and snapshot identity. The engine should normalize Markitect results
into its common representation and preserve source/adapter provenance rather
than rebuilding Markdown syntax behavior.
Implementation Status
As of 2026-05-06, the first ingestion slice is recorded in
docs/ingestion-implementation.md. It establishes ingestion job primitives,
connector/extractor ports, local file ingestion, plain text normalization,
Markitect markdown adapter boundaries, directory partial-result reporting, and
in-memory/SQLite job persistence. It now also includes explicit ingestion
identity policy, content-digest identity for governed file move/rename
reconciliation, unchanged-source skip behavior, and directory item retry/skipped
reporting. CSV/TSV datasets now produce structured normalized table output, and
PDF/office-like files can enter the governed asset set through metadata-only
placeholder extraction with explicit unsupported-depth diagnostics. Normalized
structure now covers text lines/paragraphs/links, CSV/TSV schemas/tables/sample
rows/links, Markitect-provided Markdown blocks/headings/sections/link tokens,
and document placeholder unsupported elements. Remaining work is focused on
async execution and optional deep non-text extraction adapters, which are
deferred to adjacent workplans. This WP-0006 foundation slice is complete.
I6.1 - Implement ingestion job model status and retry surface
id: KONT-WP-0006-T001
status: done
priority: high
state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753"
Define ingestion jobs that support queued, running, completed, failed, partially completed, retried, quarantined, and canceled states.
Acceptance:
- Ingestion requests return job IDs and correlation IDs.
- Job status exposes input, actor, source reference, output assets, failures, retry options, and partial results.
- Failed ingestion does not silently enter the trusted asset set.
I6.2 - Implement connector and extractor contracts
id: KONT-WP-0006-T002
status: done
priority: high
state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1"
Define source connector and format extractor protocols that can provide source references, metadata, permission context, content streams, and normalized outputs.
Acceptance:
- Connectors can describe capabilities and supported source types.
- Extractors can describe supported media types and extraction depth.
- External extraction results can be accepted with provenance.
I6.3 - Implement local file and directory ingestion
id: KONT-WP-0006-T003
status: done
priority: high
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"
Create the first concrete source connector for local files and directories.
Acceptance:
- Local files can be ingested as source-referenced knowledge assets.
- Directory ingestion reports per-file success, skip, failure, and retry state.
- File path changes can be represented without changing stable asset identity when identity policy permits.
Implemented:
IngestionIdentityPolicy.SOURCE_LOCATIONremains the conservative default.IngestionIdentityPolicy.CONTENT_DIGESTpreserves asset identity across file moves or renames when the caller opts into content identity.- Existing assets receive a versioned
asset.ingest.updaterecord with new source references and representations. - Re-ingesting an unchanged source is reported as a skipped child item without creating another asset version.
I6.4 - Implement text and markdown normalization via markitect-tool adapter
id: KONT-WP-0006-T004
status: done
priority: high
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"
Normalize plain text directly and markdown through markitect-tool adapter
boundaries, without reimplementing markdown syntax primitives here.
Acceptance:
- Plain text produces normalized text representation and source provenance.
- Markdown extraction delegates to
markitect-toolwhen available. - Missing adapter dependencies fail with structured adapter errors.
- Parser, selector extraction, and snapshot identity behavior are covered by the Markitect integration contract tests.
Implemented:
- Plain text normalization produces source-grounded normalized representations.
- Markdown normalization imports and calls
markitect-toolonly inside the adapter boundary. - Missing
markitect-toolraises structuredAdapterUnavailableErrordiagnostics. - Adapter unit tests verify delegation and missing-dependency behavior.
- Optional contract tests verify parser, selector extraction, operations,
snapshot identity, context packages, contracts, and schema behavior against
the local
markitect-toolcheckout when available.
I6.5 - Implement PDF office document and dataset baseline adapters
id: KONT-WP-0006-T005
status: done
priority: high
state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd"
Provide baseline ingestion adapters for PDFs, office-like documents, and structured datasets using optional dependencies or adapter stubs with explicit capability reporting.
Acceptance:
- Baseline formats can be represented as knowledge assets.
- Unsupported extraction depth is reported explicitly.
- CSV or table-like datasets produce structured normalized output.
Implemented:
CsvDatasetExtractorsupports CSV and TSV sources with structured columns, row counts, table metadata, and normalized dataset fields.DocumentPlaceholderExtractorsupports PDF and common office media types as metadata-only assets withextraction.depth_unsupporteddiagnostics.- Local file media-type detection is explicit for CSV, TSV, PDF, DOC/DOCX, XLS/XLSX, and PPT/PPTX.
I6.6 - Extract structural elements into common normalized representation
id: KONT-WP-0006-T006
status: done
priority: medium
state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542"
Represent titles, sections, headings, paragraphs, tables, links, embedded references, fields, and confidence signals where extractors can recover them.
Acceptance:
- Normalized representation supports text, structure, tables, links, and extractor metadata.
- Structural output can feed search, snippets, transformations, and context packages.
- Extractor confidence and unsupported elements are visible.
Implemented:
- The shared
NormalizedDocumentcontract already carries text, structure, tables, links, fields, confidence, unsupported elements, and extractor metadata. - Plain text emits line and paragraph units plus URL links.
- CSV/TSV emits table schema, rows, sample rows, and URL links from cells.
- Markdown maps Markitect-provided blocks, headings, sections, table blocks, link tokens, and snapshot metadata without reimplementing Markdown parsing.
- PDF/office placeholders emit unsupported element structure and low-confidence metadata-only normalized output.
I6.7 - Validate ingestion output quarantine failures and preserve provenance
id: KONT-WP-0006-T007
status: done
priority: medium
state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c"
Validate normalized content, required metadata, source provenance, permissions, and policy constraints before ingestion completion.
Acceptance:
- Invalid output is quarantined or failed with structured diagnostics.
- Re-ingestion preserves identity, provenance, permissions, versions, and relationships where policy allows.
- Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable items separately.
Implemented:
- Ingestion validates extraction output before registry writes.
- Empty normalized output, missing/mismatched source checksum provenance, missing extractor metadata, denied source permission context, and low confidence without explicit unsupported elements quarantine the file job without adding a trusted asset.
- Source permission context is preserved on representations and metadata records for accepted assets.
- Directory ingestion reports succeeded, failed, skipped, quarantined, and retriable item counts separately.
Definition Of Done
- Local file, text, markdown, PDF/document placeholder, and dataset ingestion scenarios are covered by tests.
- Job status and provenance are inspectable through programmatic APIs.
- Connector and extractor boundaries follow
docs/architecture-blueprint.md. python3 -m pytestpasses.