generated from coulomb/repo-seed
default source-location identity and opt-in content-digest identity for file move/rename reconciliation, PDF/DOCX-style placeholder ingestion
This commit is contained in:
@@ -34,10 +34,20 @@ The new `AssetIngestionService` is separate from the older artifact-era
|
||||
- Connector and extractor port contracts owned by the engine.
|
||||
- Local file connector with source references, checksums, media type detection,
|
||||
file metadata, and directory file iteration.
|
||||
- Explicit ingestion identity policy with conservative source-location identity
|
||||
by default and opt-in content-digest identity for governed file move/rename
|
||||
reconciliation.
|
||||
- Plain text extractor producing a normalized engine representation.
|
||||
- CSV/TSV dataset extractor producing structured normalized table output with
|
||||
columns, row counts, and table metadata.
|
||||
- PDF and office document placeholder extractor that represents binary
|
||||
documents as governed assets while reporting metadata-only extraction depth.
|
||||
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
|
||||
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
|
||||
when available.
|
||||
- Missing `markitect-tool` dependency fails through structured
|
||||
`AdapterUnavailableError` diagnostics instead of falling back to local
|
||||
Markdown parsing.
|
||||
- Synchronous first-run ingestion flow that creates governed assets through
|
||||
`AssetRegistryService`.
|
||||
- Source and normalized `AssetRepresentation` records for ingested files.
|
||||
@@ -46,6 +56,12 @@ The new `AssetIngestionService` is separate from the older artifact-era
|
||||
- Failed unsupported-media ingestion records job failure without adding an asset
|
||||
to the trusted registry.
|
||||
- Directory ingestion with per-file child jobs and partial result accounting.
|
||||
- Directory item results distinguish succeeded, skipped, failed, quarantined,
|
||||
and retriable failure state.
|
||||
- Re-ingestion can update an existing asset with new source references and
|
||||
source/normalized representations instead of creating a second asset.
|
||||
- Unchanged source re-ingestion can be skipped without creating a new asset
|
||||
version.
|
||||
- In-memory and SQLite job persistence.
|
||||
|
||||
## Current SQLite Additions
|
||||
@@ -65,11 +81,9 @@ document classes part of the engine domain model.
|
||||
## Not Yet Implemented
|
||||
|
||||
- Asynchronous job runner and queue dispatch.
|
||||
- Re-ingestion reconciliation for existing assets.
|
||||
- Identity policies that preserve asset identity across source moves.
|
||||
- PDF, office document, and dataset extractors.
|
||||
- Deep normalized structure for tables, links, embedded references, and fields
|
||||
beyond extractor-provided metadata.
|
||||
beyond extractor-provided metadata and the CSV/TSV baseline.
|
||||
- Optional deep PDF and office document extraction adapters.
|
||||
- Quarantine policy checks beyond unsupported/failed extraction paths.
|
||||
|
||||
## Test Coverage
|
||||
@@ -81,4 +95,13 @@ document classes part of the engine domain model.
|
||||
- job persistence and status inspection,
|
||||
- unsupported media failure without trusted asset creation,
|
||||
- directory partial success/failure accounting,
|
||||
- directory skipped item and retriable failure reporting,
|
||||
- content-digest identity preserving asset identity across file moves,
|
||||
- unchanged source re-ingestion skip behavior,
|
||||
- Markitect markdown adapter delegation and missing-dependency behavior,
|
||||
- CSV dataset structured normalization,
|
||||
- PDF and office placeholder ingestion with explicit unsupported-depth
|
||||
diagnostics,
|
||||
- optional Markitect integration contract tests for parser, selector,
|
||||
operation, snapshot, context package, contract, and schema behavior,
|
||||
- SQLite reload preserving ingestion jobs and ingested asset state.
|
||||
|
||||
Reference in New Issue
Block a user