generated from coulomb/repo-seed
263 lines
9.5 KiB
Markdown
263 lines
9.5 KiB
Markdown
---
|
|
id: KONT-WP-0006
|
|
type: workplan
|
|
title: "Multi-Format Ingestion And Normalization"
|
|
domain: markitect
|
|
repo: kontextual-engine
|
|
status: done
|
|
owner: codex
|
|
topic_slug: markitect
|
|
planning_priority: high
|
|
planning_order: 6
|
|
created: "2026-05-05"
|
|
updated: "2026-05-06"
|
|
state_hub_workstream_id: "270c83c0-eaed-4143-99d0-bb3fcfd23758"
|
|
---
|
|
|
|
# KONT-WP-0006: Multi-Format Ingestion And Normalization
|
|
|
|
## Purpose
|
|
|
|
Implement ingestion as an observable, retryable, provenance-preserving job
|
|
system that can bring heterogeneous information assets into the engine and
|
|
normalize them into a common representation for retrieval, metadata,
|
|
relationships, transformations, workflows, and agent context.
|
|
|
|
## Requirement Coverage
|
|
|
|
Primary: FR-020 to FR-030.
|
|
|
|
Supporting: FR-001 to FR-008, FR-022 to FR-028, FR-200 to FR-202,
|
|
FR-240 to FR-244.
|
|
|
|
## Architecture Constraint
|
|
|
|
Implement ingestion through connector and extractor ports described in
|
|
`docs/architecture-blueprint.md`. Format-specific parsing, local filesystem
|
|
access, `markitect-tool`, PDF/document libraries, and dataset readers must live
|
|
behind adapters, not in the domain core.
|
|
|
|
## markitect-tool Boundary Remark
|
|
|
|
Markdown ingestion must use `markitect-tool` for Markdown parsing,
|
|
frontmatter, headings, sections, selectors, includes, contract checks where
|
|
needed, and snapshot identity. The engine should normalize Markitect results
|
|
into its common representation and preserve source/adapter provenance rather
|
|
than rebuilding Markdown syntax behavior.
|
|
|
|
## Implementation Status
|
|
|
|
As of 2026-05-06, the first ingestion slice is recorded in
|
|
`docs/ingestion-implementation.md`. It establishes ingestion job primitives,
|
|
connector/extractor ports, local file ingestion, plain text normalization,
|
|
Markitect markdown adapter boundaries, directory partial-result reporting, and
|
|
in-memory/SQLite job persistence. It now also includes explicit ingestion
|
|
identity policy, content-digest identity for governed file move/rename
|
|
reconciliation, unchanged-source skip behavior, and directory item retry/skipped
|
|
reporting. CSV/TSV datasets now produce structured normalized table output, and
|
|
PDF/office-like files can enter the governed asset set through metadata-only
|
|
placeholder extraction with explicit unsupported-depth diagnostics. Normalized
|
|
structure now covers text lines/paragraphs/links, CSV/TSV schemas/tables/sample
|
|
rows/links, Markitect-provided Markdown blocks/headings/sections/link tokens,
|
|
and document placeholder unsupported elements. Remaining work is focused on
|
|
async execution and optional deep non-text extraction adapters, which are
|
|
deferred to adjacent workplans. This WP-0006 foundation slice is complete.
|
|
|
|
## I6.1 - Implement ingestion job model status and retry surface
|
|
|
|
```task
|
|
id: KONT-WP-0006-T001
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753"
|
|
```
|
|
|
|
Define ingestion jobs that support queued, running, completed, failed,
|
|
partially completed, retried, quarantined, and canceled states.
|
|
|
|
Acceptance:
|
|
|
|
- Ingestion requests return job IDs and correlation IDs.
|
|
- Job status exposes input, actor, source reference, output assets, failures,
|
|
retry options, and partial results.
|
|
- Failed ingestion does not silently enter the trusted asset set.
|
|
|
|
## I6.2 - Implement connector and extractor contracts
|
|
|
|
```task
|
|
id: KONT-WP-0006-T002
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1"
|
|
```
|
|
|
|
Define source connector and format extractor protocols that can provide source
|
|
references, metadata, permission context, content streams, and normalized
|
|
outputs.
|
|
|
|
Acceptance:
|
|
|
|
- Connectors can describe capabilities and supported source types.
|
|
- Extractors can describe supported media types and extraction depth.
|
|
- External extraction results can be accepted with provenance.
|
|
|
|
## I6.3 - Implement local file and directory ingestion
|
|
|
|
```task
|
|
id: KONT-WP-0006-T003
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"
|
|
```
|
|
|
|
Create the first concrete source connector for local files and directories.
|
|
|
|
Acceptance:
|
|
|
|
- Local files can be ingested as source-referenced knowledge assets.
|
|
- Directory ingestion reports per-file success, skip, failure, and retry state.
|
|
- File path changes can be represented without changing stable asset identity
|
|
when identity policy permits.
|
|
|
|
Implemented:
|
|
|
|
- `IngestionIdentityPolicy.SOURCE_LOCATION` remains the conservative default.
|
|
- `IngestionIdentityPolicy.CONTENT_DIGEST` preserves asset identity across file
|
|
moves or renames when the caller opts into content identity.
|
|
- Existing assets receive a versioned `asset.ingest.update` record with new
|
|
source references and representations.
|
|
- Re-ingesting an unchanged source is reported as a skipped child item without
|
|
creating another asset version.
|
|
|
|
## I6.4 - Implement text and markdown normalization via markitect-tool adapter
|
|
|
|
```task
|
|
id: KONT-WP-0006-T004
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"
|
|
```
|
|
|
|
Normalize plain text directly and markdown through `markitect-tool` adapter
|
|
boundaries, without reimplementing markdown syntax primitives here.
|
|
|
|
Acceptance:
|
|
|
|
- Plain text produces normalized text representation and source provenance.
|
|
- Markdown extraction delegates to `markitect-tool` when available.
|
|
- Missing adapter dependencies fail with structured adapter errors.
|
|
- Parser, selector extraction, and snapshot identity behavior are covered by
|
|
the Markitect integration contract tests.
|
|
|
|
Implemented:
|
|
|
|
- Plain text normalization produces source-grounded normalized representations.
|
|
- Markdown normalization imports and calls `markitect-tool` only inside the
|
|
adapter boundary.
|
|
- Missing `markitect-tool` raises structured `AdapterUnavailableError`
|
|
diagnostics.
|
|
- Adapter unit tests verify delegation and missing-dependency behavior.
|
|
- Optional contract tests verify parser, selector extraction, operations,
|
|
snapshot identity, context packages, contracts, and schema behavior against
|
|
the local `markitect-tool` checkout when available.
|
|
|
|
## I6.5 - Implement PDF office document and dataset baseline adapters
|
|
|
|
```task
|
|
id: KONT-WP-0006-T005
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd"
|
|
```
|
|
|
|
Provide baseline ingestion adapters for PDFs, office-like documents, and
|
|
structured datasets using optional dependencies or adapter stubs with explicit
|
|
capability reporting.
|
|
|
|
Acceptance:
|
|
|
|
- Baseline formats can be represented as knowledge assets.
|
|
- Unsupported extraction depth is reported explicitly.
|
|
- CSV or table-like datasets produce structured normalized output.
|
|
|
|
Implemented:
|
|
|
|
- `CsvDatasetExtractor` supports CSV and TSV sources with structured columns,
|
|
row counts, table metadata, and normalized dataset fields.
|
|
- `DocumentPlaceholderExtractor` supports PDF and common office media types as
|
|
metadata-only assets with `extraction.depth_unsupported` diagnostics.
|
|
- Local file media-type detection is explicit for CSV, TSV, PDF, DOC/DOCX,
|
|
XLS/XLSX, and PPT/PPTX.
|
|
|
|
## I6.6 - Extract structural elements into common normalized representation
|
|
|
|
```task
|
|
id: KONT-WP-0006-T006
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542"
|
|
```
|
|
|
|
Represent titles, sections, headings, paragraphs, tables, links, embedded
|
|
references, fields, and confidence signals where extractors can recover them.
|
|
|
|
Acceptance:
|
|
|
|
- Normalized representation supports text, structure, tables, links, and
|
|
extractor metadata.
|
|
- Structural output can feed search, snippets, transformations, and context
|
|
packages.
|
|
- Extractor confidence and unsupported elements are visible.
|
|
|
|
Implemented:
|
|
|
|
- The shared `NormalizedDocument` contract already carries text, structure,
|
|
tables, links, fields, confidence, unsupported elements, and extractor
|
|
metadata.
|
|
- Plain text emits line and paragraph units plus URL links.
|
|
- CSV/TSV emits table schema, rows, sample rows, and URL links from cells.
|
|
- Markdown maps Markitect-provided blocks, headings, sections, table blocks,
|
|
link tokens, and snapshot metadata without reimplementing Markdown parsing.
|
|
- PDF/office placeholders emit unsupported element structure and low-confidence
|
|
metadata-only normalized output.
|
|
|
|
## I6.7 - Validate ingestion output quarantine failures and preserve provenance
|
|
|
|
```task
|
|
id: KONT-WP-0006-T007
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c"
|
|
```
|
|
|
|
Validate normalized content, required metadata, source provenance, permissions,
|
|
and policy constraints before ingestion completion.
|
|
|
|
Acceptance:
|
|
|
|
- Invalid output is quarantined or failed with structured diagnostics.
|
|
- Re-ingestion preserves identity, provenance, permissions, versions, and
|
|
relationships where policy allows.
|
|
- Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable
|
|
items separately.
|
|
|
|
Implemented:
|
|
|
|
- Ingestion validates extraction output before registry writes.
|
|
- Empty normalized output, missing/mismatched source checksum provenance,
|
|
missing extractor metadata, denied source permission context, and low
|
|
confidence without explicit unsupported elements quarantine the file job
|
|
without adding a trusted asset.
|
|
- Source permission context is preserved on representations and metadata
|
|
records for accepted assets.
|
|
- Directory ingestion reports succeeded, failed, skipped, quarantined, and
|
|
retriable item counts separately.
|
|
|
|
## Definition Of Done
|
|
|
|
- Local file, text, markdown, PDF/document placeholder, and dataset ingestion
|
|
scenarios are covered by tests.
|
|
- Job status and provenance are inspectable through programmatic APIs.
|
|
- Connector and extractor boundaries follow `docs/architecture-blueprint.md`.
|
|
- `python3 -m pytest` passes.
|