generated from coulomb/repo-seed
richer normalized structure, permission context preservation
This commit is contained in:
@@ -4,7 +4,7 @@ type: workplan
|
||||
title: "Multi-Format Ingestion And Normalization"
|
||||
domain: markitect
|
||||
repo: kontextual-engine
|
||||
status: active
|
||||
status: done
|
||||
owner: codex
|
||||
topic_slug: markitect
|
||||
planning_priority: high
|
||||
@@ -56,9 +56,12 @@ identity policy, content-digest identity for governed file move/rename
|
||||
reconciliation, unchanged-source skip behavior, and directory item retry/skipped
|
||||
reporting. CSV/TSV datasets now produce structured normalized table output, and
|
||||
PDF/office-like files can enter the governed asset set through metadata-only
|
||||
placeholder extraction with explicit unsupported-depth diagnostics. Remaining
|
||||
work is focused on async execution, richer structural extraction, quarantine
|
||||
policy checks, and optional deep non-text extraction adapters.
|
||||
placeholder extraction with explicit unsupported-depth diagnostics. Normalized
|
||||
structure now covers text lines/paragraphs/links, CSV/TSV schemas/tables/sample
|
||||
rows/links, Markitect-provided Markdown blocks/headings/sections/link tokens,
|
||||
and document placeholder unsupported elements. Remaining work is focused on
|
||||
async execution and optional deep non-text extraction adapters, which are
|
||||
deferred to adjacent workplans. This WP-0006 foundation slice is complete.
|
||||
|
||||
## I6.1 - Implement ingestion job model status and retry surface
|
||||
|
||||
@@ -190,7 +193,7 @@ Implemented:
|
||||
|
||||
```task
|
||||
id: KONT-WP-0006-T006
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "7421bc87-d962-4938-9aa3-591f8489e542"
|
||||
```
|
||||
@@ -206,11 +209,23 @@ Acceptance:
|
||||
packages.
|
||||
- Extractor confidence and unsupported elements are visible.
|
||||
|
||||
Implemented:
|
||||
|
||||
- The shared `NormalizedDocument` contract already carries text, structure,
|
||||
tables, links, fields, confidence, unsupported elements, and extractor
|
||||
metadata.
|
||||
- Plain text emits line and paragraph units plus URL links.
|
||||
- CSV/TSV emits table schema, rows, sample rows, and URL links from cells.
|
||||
- Markdown maps Markitect-provided blocks, headings, sections, table blocks,
|
||||
link tokens, and snapshot metadata without reimplementing Markdown parsing.
|
||||
- PDF/office placeholders emit unsupported element structure and low-confidence
|
||||
metadata-only normalized output.
|
||||
|
||||
## I6.7 - Validate ingestion output quarantine failures and preserve provenance
|
||||
|
||||
```task
|
||||
id: KONT-WP-0006-T007
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "07b32021-3701-437a-ae87-030bed56a25c"
|
||||
```
|
||||
@@ -226,6 +241,18 @@ Acceptance:
|
||||
- Batch ingestion reports succeeded, failed, skipped, quarantined, and retriable
|
||||
items separately.
|
||||
|
||||
Implemented:
|
||||
|
||||
- Ingestion validates extraction output before registry writes.
|
||||
- Empty normalized output, missing/mismatched source checksum provenance,
|
||||
missing extractor metadata, denied source permission context, and low
|
||||
confidence without explicit unsupported elements quarantine the file job
|
||||
without adding a trusted asset.
|
||||
- Source permission context is preserved on representations and metadata
|
||||
records for accepted assets.
|
||||
- Directory ingestion reports succeeded, failed, skipped, quarantined, and
|
||||
retriable item counts separately.
|
||||
|
||||
## Definition Of Done
|
||||
|
||||
- Local file, text, markdown, PDF/document placeholder, and dataset ingestion
|
||||
|
||||
Reference in New Issue
Block a user