generated from coulomb/repo-seed
richer normalized structure, permission context preservation
This commit is contained in:
@@ -38,13 +38,21 @@ The new `AssetIngestionService` is separate from the older artifact-era
|
||||
by default and opt-in content-digest identity for governed file move/rename
|
||||
reconciliation.
|
||||
- Plain text extractor producing a normalized engine representation.
|
||||
- Plain text structural output includes lines, paragraphs, link extraction,
|
||||
confidence, and extractor metadata.
|
||||
- CSV/TSV dataset extractor producing structured normalized table output with
|
||||
columns, row counts, and table metadata.
|
||||
- CSV/TSV structural output includes table schemas, sample rows, link
|
||||
extraction from cell values, confidence, and extractor metadata.
|
||||
- PDF and office document placeholder extractor that represents binary
|
||||
documents as governed assets while reporting metadata-only extraction depth.
|
||||
- PDF and office placeholder output exposes unsupported elements, unsupported
|
||||
counts, confidence, and extractor metadata.
|
||||
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
|
||||
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
|
||||
when available.
|
||||
- Markdown normalized structure preserves Markitect-provided blocks, headings,
|
||||
sections, snapshot metadata, table blocks, and link tokens where available.
|
||||
- Missing `markitect-tool` dependency fails through structured
|
||||
`AdapterUnavailableError` diagnostics instead of falling back to local
|
||||
Markdown parsing.
|
||||
@@ -62,6 +70,13 @@ The new `AssetIngestionService` is separate from the older artifact-era
|
||||
source/normalized representations instead of creating a second asset.
|
||||
- Unchanged source re-ingestion can be skipped without creating a new asset
|
||||
version.
|
||||
- Ingestion validation runs after extraction and before registry writes.
|
||||
- Invalid normalized output, missing source provenance, checksum mismatch, low
|
||||
confidence without explicit unsupported elements, missing extractor
|
||||
provenance, or denied source permission context quarantine the job without
|
||||
creating a trusted asset.
|
||||
- Source permission context is preserved on source/normalized representation
|
||||
metadata and as a metadata record when present.
|
||||
- In-memory and SQLite job persistence.
|
||||
|
||||
## Current SQLite Additions
|
||||
@@ -82,9 +97,9 @@ document classes part of the engine domain model.
|
||||
|
||||
- Asynchronous job runner and queue dispatch.
|
||||
- Deep normalized structure for tables, links, embedded references, and fields
|
||||
beyond extractor-provided metadata and the CSV/TSV baseline.
|
||||
beyond extractor-provided metadata and current text/Markdown/CSV baselines.
|
||||
- Optional deep PDF and office document extraction adapters.
|
||||
- Quarantine policy checks beyond unsupported/failed extraction paths.
|
||||
- Enterprise policy adapter integration for ingestion-time policy decisions.
|
||||
|
||||
## Test Coverage
|
||||
|
||||
@@ -100,8 +115,14 @@ document classes part of the engine domain model.
|
||||
- unchanged source re-ingestion skip behavior,
|
||||
- Markitect markdown adapter delegation and missing-dependency behavior,
|
||||
- CSV dataset structured normalization,
|
||||
- normalized structure coverage for text, Markdown, CSV, and document
|
||||
placeholders,
|
||||
- PDF and office placeholder ingestion with explicit unsupported-depth
|
||||
diagnostics,
|
||||
- ingestion validation/quarantine before registry writes,
|
||||
- permission-context preservation on trusted ingested assets,
|
||||
- directory reporting for succeeded, skipped, failed, quarantined, and
|
||||
retriable items,
|
||||
- optional Markitect integration contract tests for parser, selector,
|
||||
operation, snapshot, context package, contract, and schema behavior,
|
||||
- SQLite reload preserving ingestion jobs and ingested asset state.
|
||||
|
||||
Reference in New Issue
Block a user