default source-location identity and opt-in content-digest identity for file move/rename reconciliation, PDF/DOCX-style placeholder ingestion

2026-05-06 13:04:36 +02:00
parent 48dffedc09
commit a4a4759ac4
13 changed files with 724 additions and 39 deletions
--- a/docs/ingestion-implementation.md
+++ b/docs/ingestion-implementation.md
@@ -34,10 +34,20 @@ The new `AssetIngestionService` is separate from the older artifact-era
 - Connector and extractor port contracts owned by the engine.
 - Local file connector with source references, checksums, media type detection,
  file metadata, and directory file iteration.
+- Explicit ingestion identity policy with conservative source-location identity
+  by default and opt-in content-digest identity for governed file move/rename
+  reconciliation.
 - Plain text extractor producing a normalized engine representation.
+- CSV/TSV dataset extractor producing structured normalized table output with
+  columns, row counts, and table metadata.
+- PDF and office document placeholder extractor that represents binary
+  documents as governed assets while reporting metadata-only extraction depth.
 - Markitect markdown extractor adapter boundary that delegates markdown parsing,
  headings, sections, frontmatter, and snapshot identity to `markitect-tool`
  when available.
+- Missing `markitect-tool` dependency fails through structured
+  `AdapterUnavailableError` diagnostics instead of falling back to local
+  Markdown parsing.
 - Synchronous first-run ingestion flow that creates governed assets through
  `AssetRegistryService`.
 - Source and normalized `AssetRepresentation` records for ingested files.
@@ -46,6 +56,12 @@ The new `AssetIngestionService` is separate from the older artifact-era
 - Failed unsupported-media ingestion records job failure without adding an asset
  to the trusted registry.
 - Directory ingestion with per-file child jobs and partial result accounting.
+- Directory item results distinguish succeeded, skipped, failed, quarantined,
+  and retriable failure state.
+- Re-ingestion can update an existing asset with new source references and
+  source/normalized representations instead of creating a second asset.
+- Unchanged source re-ingestion can be skipped without creating a new asset
+  version.
 - In-memory and SQLite job persistence.

 ## Current SQLite Additions
@@ -65,11 +81,9 @@ document classes part of the engine domain model.
 ## Not Yet Implemented

 - Asynchronous job runner and queue dispatch.
- Re-ingestion reconciliation for existing assets.
- Identity policies that preserve asset identity across source moves.
- PDF, office document, and dataset extractors.
 - Deep normalized structure for tables, links, embedded references, and fields
-  beyond extractor-provided metadata.
+  beyond extractor-provided metadata and the CSV/TSV baseline.
+- Optional deep PDF and office document extraction adapters.
 - Quarantine policy checks beyond unsupported/failed extraction paths.

 ## Test Coverage
@@ -81,4 +95,13 @@ document classes part of the engine domain model.
 - job persistence and status inspection,
 - unsupported media failure without trusted asset creation,
 - directory partial success/failure accounting,
+- directory skipped item and retriable failure reporting,
+- content-digest identity preserving asset identity across file moves,
+- unchanged source re-ingestion skip behavior,
+- Markitect markdown adapter delegation and missing-dependency behavior,
+- CSV dataset structured normalization,
+- PDF and office placeholder ingestion with explicit unsupported-depth
+  diagnostics,
+- optional Markitect integration contract tests for parser, selector,
+  operation, snapshot, context package, contract, and schema behavior,
 - SQLite reload preserving ingestion jobs and ingested asset state.