richer normalized structure, permission context preservation

This commit is contained in:
2026-05-06 13:43:16 +02:00
parent a4a4759ac4
commit 24cb3c5b6a
10 changed files with 636 additions and 15 deletions

View File

@@ -38,13 +38,21 @@ The new `AssetIngestionService` is separate from the older artifact-era
by default and opt-in content-digest identity for governed file move/rename
reconciliation.
- Plain text extractor producing a normalized engine representation.
- Plain text structural output includes lines, paragraphs, link extraction,
confidence, and extractor metadata.
- CSV/TSV dataset extractor producing structured normalized table output with
columns, row counts, and table metadata.
- CSV/TSV structural output includes table schemas, sample rows, link
extraction from cell values, confidence, and extractor metadata.
- PDF and office document placeholder extractor that represents binary
documents as governed assets while reporting metadata-only extraction depth.
- PDF and office placeholder output exposes unsupported elements, unsupported
counts, confidence, and extractor metadata.
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
when available.
- Markdown normalized structure preserves Markitect-provided blocks, headings,
sections, snapshot metadata, table blocks, and link tokens where available.
- Missing `markitect-tool` dependency fails through structured
`AdapterUnavailableError` diagnostics instead of falling back to local
Markdown parsing.
@@ -62,6 +70,13 @@ The new `AssetIngestionService` is separate from the older artifact-era
source/normalized representations instead of creating a second asset.
- Unchanged source re-ingestion can be skipped without creating a new asset
version.
- Ingestion validation runs after extraction and before registry writes.
- Invalid normalized output, missing source provenance, checksum mismatch, low
confidence without explicit unsupported elements, missing extractor
provenance, or denied source permission context quarantine the job without
creating a trusted asset.
- Source permission context is preserved on source/normalized representation
metadata and as a metadata record when present.
- In-memory and SQLite job persistence.
## Current SQLite Additions
@@ -82,9 +97,9 @@ document classes part of the engine domain model.
- Asynchronous job runner and queue dispatch.
- Deep normalized structure for tables, links, embedded references, and fields
beyond extractor-provided metadata and the CSV/TSV baseline.
beyond extractor-provided metadata and current text/Markdown/CSV baselines.
- Optional deep PDF and office document extraction adapters.
- Quarantine policy checks beyond unsupported/failed extraction paths.
- Enterprise policy adapter integration for ingestion-time policy decisions.
## Test Coverage
@@ -100,8 +115,14 @@ document classes part of the engine domain model.
- unchanged source re-ingestion skip behavior,
- Markitect markdown adapter delegation and missing-dependency behavior,
- CSV dataset structured normalization,
- normalized structure coverage for text, Markdown, CSV, and document
placeholders,
- PDF and office placeholder ingestion with explicit unsupported-depth
diagnostics,
- ingestion validation/quarantine before registry writes,
- permission-context preservation on trusted ingested assets,
- directory reporting for succeeded, skipped, failed, quarantined, and
retriable items,
- optional Markitect integration contract tests for parser, selector,
operation, snapshot, context package, contract, and schema behavior,
- SQLite reload preserving ingestion jobs and ingested asset state.