default source-location identity and opt-in content-digest identity for file move/rename reconciliation, PDF/DOCX-style placeholder ingestion

This commit is contained in:
2026-05-06 13:04:36 +02:00
parent 48dffedc09
commit a4a4759ac4
13 changed files with 724 additions and 39 deletions

View File

@@ -51,9 +51,14 @@ As of 2026-05-06, the first ingestion slice is recorded in
`docs/ingestion-implementation.md`. It establishes ingestion job primitives,
connector/extractor ports, local file ingestion, plain text normalization,
Markitect markdown adapter boundaries, directory partial-result reporting, and
in-memory/SQLite job persistence. Remaining work is focused on async execution,
re-ingestion identity reconciliation, richer structural extraction, quarantine
policy checks, and non-text format adapters.
in-memory/SQLite job persistence. It now also includes explicit ingestion
identity policy, content-digest identity for governed file move/rename
reconciliation, unchanged-source skip behavior, and directory item retry/skipped
reporting. CSV/TSV datasets now produce structured normalized table output, and
PDF/office-like files can enter the governed asset set through metadata-only
placeholder extraction with explicit unsupported-depth diagnostics. Remaining
work is focused on async execution, richer structural extraction, quarantine
policy checks, and optional deep non-text extraction adapters.
## I6.1 - Implement ingestion job model status and retry surface
@@ -97,7 +102,7 @@ Acceptance:
```task
id: KONT-WP-0006-T003
status: in_progress
status: done
priority: high
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"
```
@@ -111,11 +116,21 @@ Acceptance:
- File path changes can be represented without changing stable asset identity
when identity policy permits.
Implemented:
- `IngestionIdentityPolicy.SOURCE_LOCATION` remains the conservative default.
- `IngestionIdentityPolicy.CONTENT_DIGEST` preserves asset identity across file
moves or renames when the caller opts into content identity.
- Existing assets receive a versioned `asset.ingest.update` record with new
source references and representations.
- Re-ingesting an unchanged source is reported as a skipped child item without
creating another asset version.
## I6.4 - Implement text and markdown normalization via markitect-tool adapter
```task
id: KONT-WP-0006-T004
status: in_progress
status: done
priority: high
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"
```
@@ -131,11 +146,23 @@ Acceptance:
- Parser, selector extraction, and snapshot identity behavior are covered by
the Markitect integration contract tests.
Implemented:
- Plain text normalization produces source-grounded normalized representations.
- Markdown normalization imports and calls `markitect-tool` only inside the
adapter boundary.
- Missing `markitect-tool` raises structured `AdapterUnavailableError`
diagnostics.
- Adapter unit tests verify delegation and missing-dependency behavior.
- Optional contract tests verify parser, selector extraction, operations,
snapshot identity, context packages, contracts, and schema behavior against
the local `markitect-tool` checkout when available.
## I6.5 - Implement PDF office document and dataset baseline adapters
```task
id: KONT-WP-0006-T005
status: todo
status: done
priority: high
state_hub_task_id: "04d7c4b0-abfd-4b14-892f-91d1c1a820cd"
```
@@ -150,6 +177,15 @@ Acceptance:
- Unsupported extraction depth is reported explicitly.
- CSV or table-like datasets produce structured normalized output.
Implemented:
- `CsvDatasetExtractor` supports CSV and TSV sources with structured columns,
row counts, table metadata, and normalized dataset fields.
- `DocumentPlaceholderExtractor` supports PDF and common office media types as
metadata-only assets with `extraction.depth_unsupported` diagnostics.
- Local file media-type detection is explicit for CSV, TSV, PDF, DOC/DOCX,
XLS/XLSX, and PPT/PPTX.
## I6.6 - Extract structural elements into common normalized representation
```task