Files

tegwick 24cb3c5b6a richer normalized structure, permission context preservation

2026-05-06 13:43:16 +02:00

9.5 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	planning_priority	planning_order	created	updated	state_hub_workstream_id
KONT-WP-0006	workplan	Multi-Format Ingestion And Normalization	markitect	kontextual-engine	done	codex	markitect	high	6	2026-05-05	2026-05-06	270c83c0-eaed-4143-99d0-bb3fcfd23758

KONT-WP-0006: Multi-Format Ingestion And Normalization

Purpose

Implement ingestion as an observable, retryable, provenance-preserving job system that can bring heterogeneous information assets into the engine and normalize them into a common representation for retrieval, metadata, relationships, transformations, workflows, and agent context.

Requirement Coverage

Primary: FR-020 to FR-030.

Supporting: FR-001 to FR-008, FR-022 to FR-028, FR-200 to FR-202, FR-240 to FR-244.

Architecture Constraint

Implement ingestion through connector and extractor ports described in docs/architecture-blueprint.md. Format-specific parsing, local filesystem access, markitect-tool, PDF/document libraries, and dataset readers must live behind adapters, not in the domain core.

markitect-tool Boundary Remark

Markdown ingestion must use markitect-tool for Markdown parsing, frontmatter, headings, sections, selectors, includes, contract checks where needed, and snapshot identity. The engine should normalize Markitect results into its common representation and preserve source/adapter provenance rather than rebuilding Markdown syntax behavior.

Implementation Status

As of 2026-05-06, the first ingestion slice is recorded in docs/ingestion-implementation.md. It establishes ingestion job primitives, connector/extractor ports, local file ingestion, plain text normalization, Markitect markdown adapter boundaries, directory partial-result reporting, and in-memory/SQLite job persistence. It now also includes explicit ingestion identity policy, content-digest identity for governed file move/rename reconciliation, unchanged-source skip behavior, and directory item retry/skipped reporting. CSV/TSV datasets now produce structured normalized table output, and PDF/office-like files can enter the governed asset set through metadata-only placeholder extraction with explicit unsupported-depth diagnostics. Normalized structure now covers text lines/paragraphs/links, CSV/TSV schemas/tables/sample rows/links, Markitect-provided Markdown blocks/headings/sections/link tokens, and document placeholder unsupported elements. Remaining work is focused on async execution and optional deep non-text extraction adapters, which are deferred to adjacent workplans. This WP-0006 foundation slice is complete.

I6.1 - Implement ingestion job model status and retry surface

id: KONT-WP-0006-T001
status: done
priority: high
state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753"

Define ingestion jobs that support queued, running, completed, failed, partially completed, retried, quarantined, and canceled states.

Acceptance:

Ingestion requests return job IDs and correlation IDs.
Job status exposes input, actor, source reference, output assets, failures, retry options, and partial results.
Failed ingestion does not silently enter the trusted asset set.

I6.2 - Implement connector and extractor contracts

id: KONT-WP-0006-T002
status: done
priority: high
state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1"

Define source connector and format extractor protocols that can provide source references, metadata, permission context, content streams, and normalized outputs.

Acceptance:

Connectors can describe capabilities and supported source types.
Extractors can describe supported media types and extraction depth.
External extraction results can be accepted with provenance.

I6.3 - Implement local file and directory ingestion

id: KONT-WP-0006-T003
status: done
priority: high
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"

Create the first concrete source connector for local files and directories.

Acceptance:

Local files can be ingested as source-referenced knowledge assets.
Directory ingestion reports per-file success, skip, failure, and retry state.
File path changes can be represented without changing stable asset identity when identity policy permits.

Implemented:

IngestionIdentityPolicy.SOURCE_LOCATION remains the conservative default.
IngestionIdentityPolicy.CONTENT_DIGEST preserves asset identity across file moves or renames when the caller opts into content identity.
Existing assets receive a versioned asset.ingest.update record with new source references and representations.
Re-ingesting an unchanged source is reported as a skipped child item without creating another asset version.

I6.4 - Implement text and markdown normalization via markitect-tool adapter

id: KONT-WP-0006-T004
status: done
priority: high
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"

Normalize plain text directly and markdown through markitect-tool adapter boundaries, without reimplementing markdown syntax primitives here.

Acceptance:

Plain text produces normalized text representation and source provenance.
Markdown extraction delegates to markitect-tool when available.
Missing adapter dependencies fail with structured adapter errors.
Parser, selector extraction, and snapshot identity behavior are covered by the Markitect integration contract tests.

Implemented:

Plain text normalization produces source-grounded normalized representations.
Markdown normalization imports and calls markitect-tool only inside the adapter boundary.
Missing markitect-tool raises structured AdapterUnavailableError diagnostics.
Adapter unit tests verify delegation and missing-dependency behavior.
Optional contract tests verify parser, selector extraction, operations, snapshot identity, context packages, contracts, and schema behavior against the local markitect-tool checkout when available.

I6.5 - Implement PDF office document and dataset baseline adapters

id: KONT-WP-0006-T005
status: done
priority: high
state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd"

Provide baseline ingestion adapters for PDFs, office-like documents, and structured datasets using optional dependencies or adapter stubs with explicit capability reporting.

Acceptance:

Baseline formats can be represented as knowledge assets.
Unsupported extraction depth is reported explicitly.
CSV or table-like datasets produce structured normalized output.

Implemented:

CsvDatasetExtractor supports CSV and TSV sources with structured columns, row counts, table metadata, and normalized dataset fields.
DocumentPlaceholderExtractor supports PDF and common office media types as metadata-only assets with extraction.depth_unsupported diagnostics.
Local file media-type detection is explicit for CSV, TSV, PDF, DOC/DOCX, XLS/XLSX, and PPT/PPTX.

I6.6 - Extract structural elements into common normalized representation

id: KONT-WP-0006-T006
status: done
priority: medium
state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542"

Represent titles, sections, headings, paragraphs, tables, links, embedded references, fields, and confidence signals where extractors can recover them.

Acceptance:

Normalized representation supports text, structure, tables, links, and extractor metadata.
Structural output can feed search, snippets, transformations, and context packages.
Extractor confidence and unsupported elements are visible.

Implemented:

The shared NormalizedDocument contract already carries text, structure, tables, links, fields, confidence, unsupported elements, and extractor metadata.
Plain text emits line and paragraph units plus URL links.
CSV/TSV emits table schema, rows, sample rows, and URL links from cells.
Markdown maps Markitect-provided blocks, headings, sections, table blocks, link tokens, and snapshot metadata without reimplementing Markdown parsing.
PDF/office placeholders emit unsupported element structure and low-confidence metadata-only normalized output.

I6.7 - Validate ingestion output quarantine failures and preserve provenance

id: KONT-WP-0006-T007
status: done
priority: medium
state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c"

Validate normalized content, required metadata, source provenance, permissions, and policy constraints before ingestion completion.

Acceptance:

Invalid output is quarantined or failed with structured diagnostics.
Re-ingestion preserves identity, provenance, permissions, versions, and relationships where policy allows.
Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable items separately.

Implemented:

Ingestion validates extraction output before registry writes.
Empty normalized output, missing/mismatched source checksum provenance, missing extractor metadata, denied source permission context, and low confidence without explicit unsupported elements quarantine the file job without adding a trusted asset.
Source permission context is preserved on representations and metadata records for accepted assets.
Directory ingestion reports succeeded, failed, skipped, quarantined, and retriable item counts separately.

Definition Of Done

Local file, text, markdown, PDF/document placeholder, and dataset ingestion scenarios are covered by tests.
Job status and provenance are inspectable through programmatic APIs.
Connector and extractor boundaries follow docs/architecture-blueprint.md.
python3 -m pytest passes.

9.5 KiB Raw Blame History

KONT-WP-0006: Multi-Format Ingestion And Normalization

Purpose

Requirement Coverage

Architecture Constraint

markitect-tool Boundary Remark

Implementation Status

I6.1 - Implement ingestion job model status and retry surface

I6.2 - Implement connector and extractor contracts

I6.3 - Implement local file and directory ingestion

I6.4 - Implement text and markdown normalization via markitect-tool adapter

I6.5 - Implement PDF office document and dataset baseline adapters

I6.6 - Extract structural elements into common normalized representation

I6.7 - Validate ingestion output quarantine failures and preserve provenance

Definition Of Done

9.5 KiB

Raw Blame History