richer normalized structure, permission context preservation

This commit is contained in:
2026-05-06 13:43:16 +02:00
parent a4a4759ac4
commit 24cb3c5b6a
10 changed files with 636 additions and 15 deletions

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Multi-Format Ingestion And Normalization"
domain: markitect
repo: kontextual-engine
status: active
status: done
owner: codex
topic_slug: markitect
planning_priority: high
@@ -56,9 +56,12 @@ identity policy, content-digest identity for governed file move/rename
reconciliation, unchanged-source skip behavior, and directory item retry/skipped
reporting. CSV/TSV datasets now produce structured normalized table output, and
PDF/office-like files can enter the governed asset set through metadata-only
placeholder extraction with explicit unsupported-depth diagnostics. Remaining
work is focused on async execution, richer structural extraction, quarantine
policy checks, and optional deep non-text extraction adapters.
placeholder extraction with explicit unsupported-depth diagnostics. Normalized
structure now covers text lines/paragraphs/links, CSV/TSV schemas/tables/sample
rows/links, Markitect-provided Markdown blocks/headings/sections/link tokens,
and document placeholder unsupported elements. Remaining work is focused on
async execution and optional deep non-text extraction adapters, which are
deferred to adjacent workplans. This WP-0006 foundation slice is complete.
## I6.1 - Implement ingestion job model status and retry surface
@@ -190,7 +193,7 @@ Implemented:
```task
id: KONT-WP-0006-T006
status: todo
status: done
priority: medium
state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542"
```
@@ -206,11 +209,23 @@ Acceptance:
packages.
- Extractor confidence and unsupported elements are visible.
Implemented:
- The shared `NormalizedDocument` contract already carries text, structure,
tables, links, fields, confidence, unsupported elements, and extractor
metadata.
- Plain text emits line and paragraph units plus URL links.
- CSV/TSV emits table schema, rows, sample rows, and URL links from cells.
- Markdown maps Markitect-provided blocks, headings, sections, table blocks,
link tokens, and snapshot metadata without reimplementing Markdown parsing.
- PDF/office placeholders emit unsupported element structure and low-confidence
metadata-only normalized output.
## I6.7 - Validate ingestion output quarantine failures and preserve provenance
```task
id: KONT-WP-0006-T007
status: todo
status: done
priority: medium
state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c"
```
@@ -226,6 +241,18 @@ Acceptance:
- Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable
items separately.
Implemented:
- Ingestion validates extraction output before registry writes.
- Empty normalized output, missing/mismatched source checksum provenance,
missing extractor metadata, denied source permission context, and low
confidence without explicit unsupported elements quarantine the file job
without adding a trusted asset.
- Source permission context is preserved on representations and metadata
records for accepted assets.
- Directory ingestion reports succeeded, failed, skipped, quarantined, and
retriable item counts separately.
## Definition Of Done
- Local file, text, markdown, PDF/document placeholder, and dataset ingestion