default source-location identity and opt-in content-digest identity for file move/rename reconciliation, PDF/DOCX-style placeholder ingestion

2026-05-06 13:04:36 +02:00
parent 48dffedc09
commit a4a4759ac4
13 changed files with 724 additions and 39 deletions
--- a/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md
+++ b/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md
@@ -51,9 +51,14 @@ As of 2026-05-06, the first ingestion slice is recorded in
 `docs/ingestion-implementation.md`. It establishes ingestion job primitives,
 connector/extractor ports, local file ingestion, plain text normalization,
 Markitect markdown adapter boundaries, directory partial-result reporting, and
-in-memory/SQLite job persistence. Remaining work is focused on async execution,
-re-ingestion identity reconciliation, richer structural extraction, quarantine
-policy checks, and non-text format adapters.
+in-memory/SQLite job persistence. It now also includes explicit ingestion
+identity policy, content-digest identity for governed file move/rename
+reconciliation, unchanged-source skip behavior, and directory item retry/skipped
+reporting. CSV/TSV datasets now produce structured normalized table output, and
+PDF/office-like files can enter the governed asset set through metadata-only
+placeholder extraction with explicit unsupported-depth diagnostics. Remaining
+work is focused on async execution, richer structural extraction, quarantine
+policy checks, and optional deep non-text extraction adapters.

 ## I6.1 - Implement ingestion job model status and retry surface

@@ -97,7 +102,7 @@ Acceptance:

 ```task
 id: KONT-WP-0006-T003
-status: in_progress
+status: done
 priority: high
 state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"
 ```
@@ -111,11 +116,21 @@ Acceptance:
 - File path changes can be represented without changing stable asset identity
  when identity policy permits.

+Implemented:
+
+- `IngestionIdentityPolicy.SOURCE_LOCATION` remains the conservative default.
+- `IngestionIdentityPolicy.CONTENT_DIGEST` preserves asset identity across file
+  moves or renames when the caller opts into content identity.
+- Existing assets receive a versioned `asset.ingest.update` record with new
+  source references and representations.
+- Re-ingesting an unchanged source is reported as a skipped child item without
+  creating another asset version.
+
 ## I6.4 - Implement text and markdown normalization via markitect-tool adapter

 ```task
 id: KONT-WP-0006-T004
-status: in_progress
+status: done
 priority: high
 state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"
 ```
@@ -131,11 +146,23 @@ Acceptance:
 - Parser, selector extraction, and snapshot identity behavior are covered by
  the Markitect integration contract tests.

+Implemented:
+
+- Plain text normalization produces source-grounded normalized representations.
+- Markdown normalization imports and calls `markitect-tool` only inside the
+  adapter boundary.
+- Missing `markitect-tool` raises structured `AdapterUnavailableError`
+  diagnostics.
+- Adapter unit tests verify delegation and missing-dependency behavior.
+- Optional contract tests verify parser, selector extraction, operations,
+  snapshot identity, context packages, contracts, and schema behavior against
+  the local `markitect-tool` checkout when available.
+
 ## I6.5 - Implement PDF office document and dataset baseline adapters

 ```task
 id: KONT-WP-0006-T005
-status: todo
+status: done
 priority: high
 state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd"
 ```
@@ -150,6 +177,15 @@ Acceptance:
 - Unsupported extraction depth is reported explicitly.
 - CSV or table-like datasets produce structured normalized output.

+Implemented:
+
+- `CsvDatasetExtractor` supports CSV and TSV sources with structured columns,
+  row counts, table metadata, and normalized dataset fields.
+- `DocumentPlaceholderExtractor` supports PDF and common office media types as
+  metadata-only assets with `extraction.depth_unsupported` diagnostics.
+- Local file media-type detection is explicit for CSV, TSV, PDF, DOC/DOCX,
+  XLS/XLSX, and PPT/PPTX.
+
 ## I6.6 - Extract structural elements into common normalized representation

 ```task