richer normalized structure, permission context preservation

2026-05-06 13:43:16 +02:00
parent a4a4759ac4
commit 24cb3c5b6a
10 changed files with 636 additions and 15 deletions
--- a/docs/ingestion-implementation.md
+++ b/docs/ingestion-implementation.md
@@ -38,13 +38,21 @@ The new `AssetIngestionService` is separate from the older artifact-era
  by default and opt-in content-digest identity for governed file move/rename
  reconciliation.
 - Plain text extractor producing a normalized engine representation.
+- Plain text structural output includes lines, paragraphs, link extraction,
+  confidence, and extractor metadata.
 - CSV/TSV dataset extractor producing structured normalized table output with
  columns, row counts, and table metadata.
+- CSV/TSV structural output includes table schemas, sample rows, link
+  extraction from cell values, confidence, and extractor metadata.
 - PDF and office document placeholder extractor that represents binary
  documents as governed assets while reporting metadata-only extraction depth.
+- PDF and office placeholder output exposes unsupported elements, unsupported
+  counts, confidence, and extractor metadata.
 - Markitect markdown extractor adapter boundary that delegates markdown parsing,
  headings, sections, frontmatter, and snapshot identity to `markitect-tool`
  when available.
+- Markdown normalized structure preserves Markitect-provided blocks, headings,
+  sections, snapshot metadata, table blocks, and link tokens where available.
 - Missing `markitect-tool` dependency fails through structured
  `AdapterUnavailableError` diagnostics instead of falling back to local
  Markdown parsing.
@@ -62,6 +70,13 @@ The new `AssetIngestionService` is separate from the older artifact-era
  source/normalized representations instead of creating a second asset.
 - Unchanged source re-ingestion can be skipped without creating a new asset
  version.
+- Ingestion validation runs after extraction and before registry writes.
+- Invalid normalized output, missing source provenance, checksum mismatch, low
+  confidence without explicit unsupported elements, missing extractor
+  provenance, or denied source permission context quarantine the job without
+  creating a trusted asset.
+- Source permission context is preserved on source/normalized representation
+  metadata and as a metadata record when present.
 - In-memory and SQLite job persistence.

 ## Current SQLite Additions
@@ -82,9 +97,9 @@ document classes part of the engine domain model.

 - Asynchronous job runner and queue dispatch.
 - Deep normalized structure for tables, links, embedded references, and fields
-  beyond extractor-provided metadata and the CSV/TSV baseline.
+  beyond extractor-provided metadata and current text/Markdown/CSV baselines.
 - Optional deep PDF and office document extraction adapters.
- Quarantine policy checks beyond unsupported/failed extraction paths.
+- Enterprise policy adapter integration for ingestion-time policy decisions.

 ## Test Coverage

@@ -100,8 +115,14 @@ document classes part of the engine domain model.
 - unchanged source re-ingestion skip behavior,
 - Markitect markdown adapter delegation and missing-dependency behavior,
 - CSV dataset structured normalization,
+- normalized structure coverage for text, Markdown, CSV, and document
+  placeholders,
 - PDF and office placeholder ingestion with explicit unsupported-depth
  diagnostics,
+- ingestion validation/quarantine before registry writes,
+- permission-context preservation on trusted ingested assets,
+- directory reporting for succeeded, skipped, failed, quarantined, and
+  retriable items,
 - optional Markitect integration contract tests for parser, selector,
  operation, snapshot, context package, contract, and schema behavior,
 - SQLite reload preserving ingestion jobs and ingested asset state.