default source-location identity and opt-in content-digest identity for file move/rename reconciliation, PDF/DOCX-style placeholder ingestion

This commit is contained in:
2026-05-06 13:04:36 +02:00
parent 48dffedc09
commit a4a4759ac4
13 changed files with 724 additions and 39 deletions

View File

@@ -34,10 +34,20 @@ The new `AssetIngestionService` is separate from the older artifact-era
- Connector and extractor port contracts owned by the engine.
- Local file connector with source references, checksums, media type detection,
file metadata, and directory file iteration.
- Explicit ingestion identity policy with conservative source-location identity
by default and opt-in content-digest identity for governed file move/rename
reconciliation.
- Plain text extractor producing a normalized engine representation.
- CSV/TSV dataset extractor producing structured normalized table output with
columns, row counts, and table metadata.
- PDF and office document placeholder extractor that represents binary
documents as governed assets while reporting metadata-only extraction depth.
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
when available.
- Missing `markitect-tool` dependency fails through structured
`AdapterUnavailableError` diagnostics instead of falling back to local
Markdown parsing.
- Synchronous first-run ingestion flow that creates governed assets through
`AssetRegistryService`.
- Source and normalized `AssetRepresentation` records for ingested files.
@@ -46,6 +56,12 @@ The new `AssetIngestionService` is separate from the older artifact-era
- Failed unsupported-media ingestion records job failure without adding an asset
to the trusted registry.
- Directory ingestion with per-file child jobs and partial result accounting.
- Directory item results distinguish succeeded, skipped, failed, quarantined,
and retriable failure state.
- Re-ingestion can update an existing asset with new source references and
source/normalized representations instead of creating a second asset.
- Unchanged source re-ingestion can be skipped without creating a new asset
version.
- In-memory and SQLite job persistence.
## Current SQLite Additions
@@ -65,11 +81,9 @@ document classes part of the engine domain model.
## Not Yet Implemented
- Asynchronous job runner and queue dispatch.
- Re-ingestion reconciliation for existing assets.
- Identity policies that preserve asset identity across source moves.
- PDF, office document, and dataset extractors.
- Deep normalized structure for tables, links, embedded references, and fields
beyond extractor-provided metadata.
beyond extractor-provided metadata and the CSV/TSV baseline.
- Optional deep PDF and office document extraction adapters.
- Quarantine policy checks beyond unsupported/failed extraction paths.
## Test Coverage
@@ -81,4 +95,13 @@ document classes part of the engine domain model.
- job persistence and status inspection,
- unsupported media failure without trusted asset creation,
- directory partial success/failure accounting,
- directory skipped item and retriable failure reporting,
- content-digest identity preserving asset identity across file moves,
- unchanged source re-ingestion skip behavior,
- Markitect markdown adapter delegation and missing-dependency behavior,
- CSV dataset structured normalization,
- PDF and office placeholder ingestion with explicit unsupported-depth
diagnostics,
- optional Markitect integration contract tests for parser, selector,
operation, snapshot, context package, contract, and schema behavior,
- SQLite reload preserving ingestion jobs and ingested asset state.