generated from coulomb/repo-seed
first ingestion/normalization slice
This commit is contained in:
84
docs/ingestion-implementation.md
Normal file
84
docs/ingestion-implementation.md
Normal file
@@ -0,0 +1,84 @@
|
|||||||
|
# Ingestion Implementation Note
|
||||||
|
|
||||||
|
Date: 2026-05-06
|
||||||
|
|
||||||
|
Status: first implementation slice for `KONT-WP-0006`.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This note records the first governed ingestion implementation built on top of
|
||||||
|
the asset registry service. It turns local source files into stable knowledge
|
||||||
|
assets with source and normalized representations while preserving source
|
||||||
|
provenance, actor context, auditability, and inspectable ingestion job state.
|
||||||
|
|
||||||
|
## Implemented Package Shape
|
||||||
|
|
||||||
|
```text
|
||||||
|
src/kontextual_engine/
|
||||||
|
core/ingestion.py
|
||||||
|
ports/ingestion.py
|
||||||
|
services/ingestion_service.py
|
||||||
|
adapters/local_files/
|
||||||
|
adapters/builtin_extractors/
|
||||||
|
adapters/markitect_tool/
|
||||||
|
```
|
||||||
|
|
||||||
|
The new `AssetIngestionService` is separate from the older artifact-era
|
||||||
|
`IngestionService` compatibility facade in `src/kontextual_engine/ingestion.py`.
|
||||||
|
|
||||||
|
## Implemented Capabilities
|
||||||
|
|
||||||
|
- Durable `IngestionJob` state with queued, running, completed, failed,
|
||||||
|
partially completed, retried, quarantined, and canceled statuses.
|
||||||
|
- Structured ingestion failures with retriable flags and diagnostic details.
|
||||||
|
- Connector and extractor port contracts owned by the engine.
|
||||||
|
- Local file connector with source references, checksums, media type detection,
|
||||||
|
file metadata, and directory file iteration.
|
||||||
|
- Plain text extractor producing a normalized engine representation.
|
||||||
|
- Markitect markdown extractor adapter boundary that delegates markdown parsing,
|
||||||
|
headings, sections, frontmatter, and snapshot identity to `markitect-tool`
|
||||||
|
when available.
|
||||||
|
- Synchronous first-run ingestion flow that creates governed assets through
|
||||||
|
`AssetRegistryService`.
|
||||||
|
- Source and normalized `AssetRepresentation` records for ingested files.
|
||||||
|
- Metadata records for connector, extractor, source digest, source media type,
|
||||||
|
size, and extraction metadata.
|
||||||
|
- Failed unsupported-media ingestion records job failure without adding an asset
|
||||||
|
to the trusted registry.
|
||||||
|
- Directory ingestion with per-file child jobs and partial result accounting.
|
||||||
|
- In-memory and SQLite job persistence.
|
||||||
|
|
||||||
|
## Current SQLite Additions
|
||||||
|
|
||||||
|
- `ingestion_jobs`
|
||||||
|
|
||||||
|
The table stores indexed status, actor, correlation ID, timestamps, and a JSON
|
||||||
|
payload for the full job contract.
|
||||||
|
|
||||||
|
## markitect-tool Boundary
|
||||||
|
|
||||||
|
Markdown ingestion uses `MarkitectMarkdownExtractor`, which imports
|
||||||
|
`markitect-tool` only inside the adapter. The engine preserves Markitect output
|
||||||
|
as normalized structure and adapter metadata; it does not make Markitect
|
||||||
|
document classes part of the engine domain model.
|
||||||
|
|
||||||
|
## Not Yet Implemented
|
||||||
|
|
||||||
|
- Asynchronous job runner and queue dispatch.
|
||||||
|
- Re-ingestion reconciliation for existing assets.
|
||||||
|
- Identity policies that preserve asset identity across source moves.
|
||||||
|
- PDF, office document, and dataset extractors.
|
||||||
|
- Deep normalized structure for tables, links, embedded references, and fields
|
||||||
|
beyond extractor-provided metadata.
|
||||||
|
- Quarantine policy checks beyond unsupported/failed extraction paths.
|
||||||
|
|
||||||
|
## Test Coverage
|
||||||
|
|
||||||
|
`tests/test_asset_ingestion_service.py` covers:
|
||||||
|
|
||||||
|
- plain text local-file ingestion into governed assets,
|
||||||
|
- source and normalized representation creation,
|
||||||
|
- job persistence and status inspection,
|
||||||
|
- unsupported media failure without trusted asset creation,
|
||||||
|
- directory partial success/failure accounting,
|
||||||
|
- SQLite reload preserving ingestion jobs and ingested asset state.
|
||||||
@@ -22,15 +22,22 @@ from .core import (
|
|||||||
AuditEvent,
|
AuditEvent,
|
||||||
AuditOutcome,
|
AuditOutcome,
|
||||||
Classification,
|
Classification,
|
||||||
|
ConnectorCapability,
|
||||||
ContextEntity,
|
ContextEntity,
|
||||||
ContextEntityType,
|
ContextEntityType,
|
||||||
CoreRelationship,
|
CoreRelationship,
|
||||||
DerivedArtifactLineage,
|
DerivedArtifactLineage,
|
||||||
|
ExtractionResult,
|
||||||
|
ExtractorCapability,
|
||||||
IdempotencyRecord,
|
IdempotencyRecord,
|
||||||
IdempotencyStatus,
|
IdempotencyStatus,
|
||||||
|
IngestionFailure,
|
||||||
|
IngestionJob,
|
||||||
|
IngestionJobStatus,
|
||||||
KnowledgeAsset,
|
KnowledgeAsset,
|
||||||
LifecycleState,
|
LifecycleState,
|
||||||
MetadataRecord,
|
MetadataRecord,
|
||||||
|
NormalizedDocument,
|
||||||
OperationContext,
|
OperationContext,
|
||||||
PolicyDecision,
|
PolicyDecision,
|
||||||
PolicyEffect,
|
PolicyEffect,
|
||||||
@@ -38,6 +45,7 @@ from .core import (
|
|||||||
RepresentationKind,
|
RepresentationKind,
|
||||||
Sensitivity,
|
Sensitivity,
|
||||||
SourceReference,
|
SourceReference,
|
||||||
|
SourcePayload,
|
||||||
VersionChangeType,
|
VersionChangeType,
|
||||||
)
|
)
|
||||||
from .errors import (
|
from .errors import (
|
||||||
@@ -50,10 +58,23 @@ from .errors import (
|
|||||||
ValidationError,
|
ValidationError,
|
||||||
)
|
)
|
||||||
from .ingestion import IngestionRequest, IngestionResult, IngestionService
|
from .ingestion import IngestionRequest, IngestionResult, IngestionService
|
||||||
from .ports import AllowAllPolicyGateway, AssetRegistryRepository, PolicyGateway
|
from .ports import (
|
||||||
|
AllowAllPolicyGateway,
|
||||||
|
AssetRegistryRepository,
|
||||||
|
DirectorySourceConnector,
|
||||||
|
FormatExtractor,
|
||||||
|
PolicyGateway,
|
||||||
|
SourceConnector,
|
||||||
|
)
|
||||||
from .query import QueryEngine, QueryResult
|
from .query import QueryEngine, QueryResult
|
||||||
from .relationships import RelationshipGraph
|
from .relationships import RelationshipGraph
|
||||||
from .services import AssetChangeResult, AssetRegistryService, RelationshipChangeResult
|
from .services import (
|
||||||
|
AssetChangeResult,
|
||||||
|
AssetIngestionResult,
|
||||||
|
AssetIngestionService,
|
||||||
|
AssetRegistryService,
|
||||||
|
RelationshipChangeResult,
|
||||||
|
)
|
||||||
from .storage import InMemoryKnowledgeRepository
|
from .storage import InMemoryKnowledgeRepository
|
||||||
from .workflows import (
|
from .workflows import (
|
||||||
InputBundle,
|
InputBundle,
|
||||||
@@ -76,6 +97,8 @@ __all__ = [
|
|||||||
"ActorType",
|
"ActorType",
|
||||||
"AssetRepresentation",
|
"AssetRepresentation",
|
||||||
"AssetChangeResult",
|
"AssetChangeResult",
|
||||||
|
"AssetIngestionResult",
|
||||||
|
"AssetIngestionService",
|
||||||
"AssetRegistryRepository",
|
"AssetRegistryRepository",
|
||||||
"AssetRegistryService",
|
"AssetRegistryService",
|
||||||
"AssetVersion",
|
"AssetVersion",
|
||||||
@@ -83,6 +106,7 @@ __all__ = [
|
|||||||
"AuditOutcome",
|
"AuditOutcome",
|
||||||
"AuthorizationError",
|
"AuthorizationError",
|
||||||
"Classification",
|
"Classification",
|
||||||
|
"ConnectorCapability",
|
||||||
"Collection",
|
"Collection",
|
||||||
"ContextAssembler",
|
"ContextAssembler",
|
||||||
"ContextEntity",
|
"ContextEntity",
|
||||||
@@ -92,12 +116,19 @@ __all__ = [
|
|||||||
"CoreRelationship",
|
"CoreRelationship",
|
||||||
"DerivedArtifactLineage",
|
"DerivedArtifactLineage",
|
||||||
"Diagnostic",
|
"Diagnostic",
|
||||||
|
"DirectorySourceConnector",
|
||||||
"DuplicateResourceError",
|
"DuplicateResourceError",
|
||||||
|
"ExtractionResult",
|
||||||
|
"ExtractorCapability",
|
||||||
|
"FormatExtractor",
|
||||||
"InMemoryAssetRegistryRepository",
|
"InMemoryAssetRegistryRepository",
|
||||||
"InMemoryKnowledgeRepository",
|
"InMemoryKnowledgeRepository",
|
||||||
"IngestionRequest",
|
"IngestionRequest",
|
||||||
"IngestionResult",
|
"IngestionResult",
|
||||||
"IngestionService",
|
"IngestionService",
|
||||||
|
"IngestionFailure",
|
||||||
|
"IngestionJob",
|
||||||
|
"IngestionJobStatus",
|
||||||
"InputBundle",
|
"InputBundle",
|
||||||
"IdempotencyRecord",
|
"IdempotencyRecord",
|
||||||
"IdempotencyStatus",
|
"IdempotencyStatus",
|
||||||
@@ -105,6 +136,7 @@ __all__ = [
|
|||||||
"KontextualError",
|
"KontextualError",
|
||||||
"LifecycleState",
|
"LifecycleState",
|
||||||
"MetadataRecord",
|
"MetadataRecord",
|
||||||
|
"NormalizedDocument",
|
||||||
"NotFoundError",
|
"NotFoundError",
|
||||||
"OperationRun",
|
"OperationRun",
|
||||||
"OperationStage",
|
"OperationStage",
|
||||||
@@ -124,6 +156,8 @@ __all__ = [
|
|||||||
"RunStatus",
|
"RunStatus",
|
||||||
"Sensitivity",
|
"Sensitivity",
|
||||||
"SourceReference",
|
"SourceReference",
|
||||||
|
"SourceConnector",
|
||||||
|
"SourcePayload",
|
||||||
"SQLiteAssetRegistryRepository",
|
"SQLiteAssetRegistryRepository",
|
||||||
"ValidationError",
|
"ValidationError",
|
||||||
"VersionChangeType",
|
"VersionChangeType",
|
||||||
|
|||||||
@@ -0,0 +1,5 @@
|
|||||||
|
"""Built-in baseline format extractors."""
|
||||||
|
|
||||||
|
from .text import PlainTextExtractor
|
||||||
|
|
||||||
|
__all__ = ["PlainTextExtractor"]
|
||||||
42
src/kontextual_engine/adapters/builtin_extractors/text.py
Normal file
42
src/kontextual_engine/adapters/builtin_extractors/text.py
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
"""Plain text normalization extractor."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from kontextual_engine.core import ExtractionResult, ExtractorCapability, NormalizedDocument, SourcePayload
|
||||||
|
|
||||||
|
|
||||||
|
class PlainTextExtractor:
|
||||||
|
name = "plain-text"
|
||||||
|
media_types = ("text/plain",)
|
||||||
|
|
||||||
|
def capabilities(self) -> ExtractorCapability:
|
||||||
|
return ExtractorCapability(
|
||||||
|
extractor_name=self.name,
|
||||||
|
media_types=self.media_types,
|
||||||
|
extraction_depth="text",
|
||||||
|
produces_structure=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
def supports(self, media_type: str) -> bool:
|
||||||
|
return media_type in self.media_types or media_type.startswith("text/plain")
|
||||||
|
|
||||||
|
def extract(self, payload: SourcePayload) -> ExtractionResult:
|
||||||
|
text = payload.read_text()
|
||||||
|
normalized = NormalizedDocument(
|
||||||
|
title=payload.title,
|
||||||
|
text=text,
|
||||||
|
fields={"line_count": len(text.splitlines())},
|
||||||
|
confidence=1.0,
|
||||||
|
extractor_metadata={
|
||||||
|
"extractor": self.name,
|
||||||
|
"source_media_type": payload.media_type,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
return ExtractionResult(
|
||||||
|
normalized=normalized,
|
||||||
|
metadata={
|
||||||
|
"extractor": self.name,
|
||||||
|
"source_digest": payload.content_digest,
|
||||||
|
"source_size_bytes": payload.size_bytes,
|
||||||
|
},
|
||||||
|
)
|
||||||
5
src/kontextual_engine/adapters/local_files/__init__.py
Normal file
5
src/kontextual_engine/adapters/local_files/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
"""Local filesystem ingestion connector."""
|
||||||
|
|
||||||
|
from .connector import LocalFileConnector
|
||||||
|
|
||||||
|
__all__ = ["LocalFileConnector"]
|
||||||
77
src/kontextual_engine/adapters/local_files/connector.py
Normal file
77
src/kontextual_engine/adapters/local_files/connector.py
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
"""Local file and directory source connector."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import mimetypes
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from kontextual_engine.core import ConnectorCapability, SourcePayload, SourceReference, content_digest
|
||||||
|
from kontextual_engine.errors import NotFoundError, ValidationError
|
||||||
|
|
||||||
|
|
||||||
|
class LocalFileConnector:
|
||||||
|
name = "local_file"
|
||||||
|
|
||||||
|
def capabilities(self) -> ConnectorCapability:
|
||||||
|
return ConnectorCapability(
|
||||||
|
connector_name=self.name,
|
||||||
|
source_types=("file", "directory"),
|
||||||
|
supports_directories=True,
|
||||||
|
metadata={"uri_schemes": ["file", "path"]},
|
||||||
|
)
|
||||||
|
|
||||||
|
def fetch(self, source_uri: str) -> SourcePayload:
|
||||||
|
path = Path(source_uri).expanduser()
|
||||||
|
if not path.exists():
|
||||||
|
raise NotFoundError("Local source file not found", details={"path": str(path)})
|
||||||
|
if not path.is_file():
|
||||||
|
raise ValidationError("Local source is not a file", details={"path": str(path)})
|
||||||
|
|
||||||
|
content = path.read_bytes()
|
||||||
|
media_type = _guess_media_type(path)
|
||||||
|
source_ref = SourceReference(
|
||||||
|
source_system=self.name,
|
||||||
|
path=str(path),
|
||||||
|
checksum=content_digest(content),
|
||||||
|
connector_ref=f"{self.name}:{path.resolve()}",
|
||||||
|
metadata=_file_metadata(path),
|
||||||
|
)
|
||||||
|
return SourcePayload(
|
||||||
|
connector_name=self.name,
|
||||||
|
source_uri=str(path),
|
||||||
|
source_ref=source_ref,
|
||||||
|
media_type=media_type,
|
||||||
|
content=content,
|
||||||
|
title=path.stem,
|
||||||
|
metadata={"filename": path.name, **_file_metadata(path)},
|
||||||
|
)
|
||||||
|
|
||||||
|
def iter_files(self, source_uri: str, *, recursive: bool = True) -> list[str]:
|
||||||
|
root = Path(source_uri).expanduser()
|
||||||
|
if not root.exists():
|
||||||
|
raise NotFoundError("Local source directory not found", details={"path": str(root)})
|
||||||
|
if root.is_file():
|
||||||
|
return [str(root)]
|
||||||
|
if not root.is_dir():
|
||||||
|
raise ValidationError("Local source is not a directory", details={"path": str(root)})
|
||||||
|
pattern = "**/*" if recursive else "*"
|
||||||
|
return sorted(str(path) for path in root.glob(pattern) if path.is_file())
|
||||||
|
|
||||||
|
|
||||||
|
def _guess_media_type(path: Path) -> str:
|
||||||
|
suffix = path.suffix.lower()
|
||||||
|
if suffix in {".md", ".markdown", ".mkd"}:
|
||||||
|
return "text/markdown"
|
||||||
|
if suffix in {".txt", ".text", ".log"}:
|
||||||
|
return "text/plain"
|
||||||
|
guessed, _ = mimetypes.guess_type(path.name)
|
||||||
|
return guessed or "application/octet-stream"
|
||||||
|
|
||||||
|
|
||||||
|
def _file_metadata(path: Path) -> dict[str, Any]:
|
||||||
|
stat = path.stat()
|
||||||
|
return {
|
||||||
|
"size_bytes": stat.st_size,
|
||||||
|
"mtime_ns": stat.st_mtime_ns,
|
||||||
|
}
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
"""markitect-tool ingestion adapter boundary."""
|
||||||
|
|
||||||
|
from .markdown import MarkitectMarkdownExtractor
|
||||||
|
|
||||||
|
__all__ = ["MarkitectMarkdownExtractor"]
|
||||||
86
src/kontextual_engine/adapters/markitect_tool/markdown.py
Normal file
86
src/kontextual_engine/adapters/markitect_tool/markdown.py
Normal file
@@ -0,0 +1,86 @@
|
|||||||
|
"""Markdown normalization through markitect-tool."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from kontextual_engine.core import ExtractionResult, ExtractorCapability, NormalizedDocument, SourcePayload
|
||||||
|
from kontextual_engine.errors import AdapterUnavailableError
|
||||||
|
|
||||||
|
|
||||||
|
class MarkitectMarkdownExtractor:
|
||||||
|
"""Adapter boundary to markitect-tool; Markdown syntax logic stays external."""
|
||||||
|
|
||||||
|
name = "markitect-tool"
|
||||||
|
media_types = ("text/markdown", "text/x-markdown")
|
||||||
|
|
||||||
|
def capabilities(self) -> ExtractorCapability:
|
||||||
|
return ExtractorCapability(
|
||||||
|
extractor_name=self.name,
|
||||||
|
media_types=self.media_types,
|
||||||
|
extraction_depth="structure",
|
||||||
|
produces_structure=True,
|
||||||
|
optional_dependency="markitect-tool",
|
||||||
|
metadata={"delegates_markdown_syntax": True},
|
||||||
|
)
|
||||||
|
|
||||||
|
def supports(self, media_type: str) -> bool:
|
||||||
|
return media_type in self.media_types
|
||||||
|
|
||||||
|
def extract(self, payload: SourcePayload) -> ExtractionResult:
|
||||||
|
try:
|
||||||
|
import markitect_tool as mkt
|
||||||
|
except Exception as exc: # pragma: no cover - depends on optional environment
|
||||||
|
raise AdapterUnavailableError(
|
||||||
|
"markitect-tool is required for markdown normalization",
|
||||||
|
details={"adapter": self.name, "media_type": payload.media_type},
|
||||||
|
) from exc
|
||||||
|
|
||||||
|
source_path = payload.source_ref.path
|
||||||
|
text = payload.read_text()
|
||||||
|
document = self._parse_document(mkt, text, source_path)
|
||||||
|
serialized = document.to_dict() if hasattr(document, "to_dict") else {}
|
||||||
|
snapshot = self._snapshot(mkt, source_path)
|
||||||
|
structure = {
|
||||||
|
"frontmatter": dict(serialized.get("frontmatter", {})),
|
||||||
|
"headings": list(serialized.get("headings", [])),
|
||||||
|
"sections": list(serialized.get("sections", [])),
|
||||||
|
}
|
||||||
|
normalized = NormalizedDocument(
|
||||||
|
title=payload.title,
|
||||||
|
text=text,
|
||||||
|
structure=structure,
|
||||||
|
fields={
|
||||||
|
"frontmatter": dict(serialized.get("frontmatter", {})),
|
||||||
|
"heading_count": len(structure["headings"]),
|
||||||
|
"section_count": len(structure["sections"]),
|
||||||
|
},
|
||||||
|
confidence=1.0,
|
||||||
|
extractor_metadata={
|
||||||
|
"extractor": self.name,
|
||||||
|
"source_media_type": payload.media_type,
|
||||||
|
"snapshot": snapshot,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
return ExtractionResult(
|
||||||
|
normalized=normalized,
|
||||||
|
metadata={
|
||||||
|
"extractor": self.name,
|
||||||
|
"frontmatter": structure["frontmatter"],
|
||||||
|
"headings": structure["headings"],
|
||||||
|
"snapshot": snapshot,
|
||||||
|
"source_digest": payload.content_digest,
|
||||||
|
"source_size_bytes": payload.size_bytes,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
def _parse_document(self, mkt: Any, text: str, source_path: str | None) -> Any:
|
||||||
|
if source_path and Path(source_path).exists() and hasattr(mkt, "parse_markdown_file"):
|
||||||
|
return mkt.parse_markdown_file(Path(source_path))
|
||||||
|
return mkt.parse_markdown(text, source_path=source_path)
|
||||||
|
|
||||||
|
def _snapshot(self, mkt: Any, source_path: str | None) -> dict[str, Any]:
|
||||||
|
if not source_path or not Path(source_path).exists() or not hasattr(mkt, "snapshot_identity_for_file"):
|
||||||
|
return {}
|
||||||
|
return mkt.snapshot_identity_for_file(Path(source_path), parse_options={"profile": "default"}).to_dict()
|
||||||
@@ -13,6 +13,8 @@ from kontextual_engine.core import (
|
|||||||
ContextEntity,
|
ContextEntity,
|
||||||
CoreRelationship,
|
CoreRelationship,
|
||||||
IdempotencyRecord,
|
IdempotencyRecord,
|
||||||
|
IngestionJob,
|
||||||
|
IngestionJobStatus,
|
||||||
KnowledgeAsset,
|
KnowledgeAsset,
|
||||||
LifecycleState,
|
LifecycleState,
|
||||||
MetadataRecord,
|
MetadataRecord,
|
||||||
@@ -32,6 +34,7 @@ class InMemoryAssetRegistryRepository:
|
|||||||
versions: dict[str, list[AssetVersion]] = field(default_factory=dict)
|
versions: dict[str, list[AssetVersion]] = field(default_factory=dict)
|
||||||
audit_events: dict[str, AuditEvent] = field(default_factory=dict)
|
audit_events: dict[str, AuditEvent] = field(default_factory=dict)
|
||||||
idempotency_records: dict[str, IdempotencyRecord] = field(default_factory=dict)
|
idempotency_records: dict[str, IdempotencyRecord] = field(default_factory=dict)
|
||||||
|
ingestion_jobs: dict[str, IngestionJob] = field(default_factory=dict)
|
||||||
|
|
||||||
def save_actor(self, actor: Actor) -> Actor:
|
def save_actor(self, actor: Actor) -> Actor:
|
||||||
self.actors[actor.id] = actor
|
self.actors[actor.id] = actor
|
||||||
@@ -190,3 +193,23 @@ class InMemoryAssetRegistryRepository:
|
|||||||
|
|
||||||
def get_idempotency_record(self, key: str) -> IdempotencyRecord | None:
|
def get_idempotency_record(self, key: str) -> IdempotencyRecord | None:
|
||||||
return self.idempotency_records.get(key)
|
return self.idempotency_records.get(key)
|
||||||
|
|
||||||
|
def save_ingestion_job(self, job: IngestionJob) -> IngestionJob:
|
||||||
|
self.ingestion_jobs[job.job_id] = job
|
||||||
|
return job
|
||||||
|
|
||||||
|
def get_ingestion_job(self, job_id: str) -> IngestionJob:
|
||||||
|
try:
|
||||||
|
return self.ingestion_jobs[job_id]
|
||||||
|
except KeyError as exc:
|
||||||
|
raise NotFoundError("Ingestion job not found", details={"job_id": job_id}) from exc
|
||||||
|
|
||||||
|
def list_ingestion_jobs(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
status: IngestionJobStatus | None = None,
|
||||||
|
) -> list[IngestionJob]:
|
||||||
|
jobs: Iterable[IngestionJob] = self.ingestion_jobs.values()
|
||||||
|
if status is not None:
|
||||||
|
jobs = [job for job in jobs if job.status == status]
|
||||||
|
return sorted(jobs, key=lambda job: (job.created_at, job.job_id))
|
||||||
|
|||||||
@@ -15,6 +15,8 @@ from kontextual_engine.core import (
|
|||||||
ContextEntity,
|
ContextEntity,
|
||||||
CoreRelationship,
|
CoreRelationship,
|
||||||
IdempotencyRecord,
|
IdempotencyRecord,
|
||||||
|
IngestionJob,
|
||||||
|
IngestionJobStatus,
|
||||||
KnowledgeAsset,
|
KnowledgeAsset,
|
||||||
LifecycleState,
|
LifecycleState,
|
||||||
MetadataRecord,
|
MetadataRecord,
|
||||||
@@ -381,6 +383,51 @@ class SQLiteAssetRegistryRepository:
|
|||||||
return None
|
return None
|
||||||
return IdempotencyRecord.from_dict(_loads(row["payload"]))
|
return IdempotencyRecord.from_dict(_loads(row["payload"]))
|
||||||
|
|
||||||
|
def save_ingestion_job(self, job: IngestionJob) -> IngestionJob:
|
||||||
|
with self._connect() as conn:
|
||||||
|
conn.execute(
|
||||||
|
"""
|
||||||
|
insert into ingestion_jobs (id, status, actor_id, correlation_id, created_at, updated_at, payload)
|
||||||
|
values (?, ?, ?, ?, ?, ?, ?)
|
||||||
|
on conflict(id) do update set
|
||||||
|
status=excluded.status,
|
||||||
|
actor_id=excluded.actor_id,
|
||||||
|
correlation_id=excluded.correlation_id,
|
||||||
|
updated_at=excluded.updated_at,
|
||||||
|
payload=excluded.payload
|
||||||
|
""",
|
||||||
|
(
|
||||||
|
job.job_id,
|
||||||
|
job.status.value,
|
||||||
|
job.actor_id,
|
||||||
|
job.correlation_id,
|
||||||
|
job.created_at,
|
||||||
|
job.updated_at,
|
||||||
|
_json(job.to_dict()),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
return job
|
||||||
|
|
||||||
|
def get_ingestion_job(self, job_id: str) -> IngestionJob:
|
||||||
|
row = self._one("select payload from ingestion_jobs where id = ?", (job_id,))
|
||||||
|
if row is None:
|
||||||
|
raise NotFoundError("Ingestion job not found", details={"job_id": job_id})
|
||||||
|
return IngestionJob.from_dict(_loads(row["payload"]))
|
||||||
|
|
||||||
|
def list_ingestion_jobs(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
status: IngestionJobStatus | None = None,
|
||||||
|
) -> list[IngestionJob]:
|
||||||
|
if status is None:
|
||||||
|
rows = self._all("select payload from ingestion_jobs order by created_at, id", ())
|
||||||
|
else:
|
||||||
|
rows = self._all(
|
||||||
|
"select payload from ingestion_jobs where status = ? order by created_at, id",
|
||||||
|
(status.value,),
|
||||||
|
)
|
||||||
|
return [IngestionJob.from_dict(_loads(row["payload"])) for row in rows]
|
||||||
|
|
||||||
def _initialize(self) -> None:
|
def _initialize(self) -> None:
|
||||||
with self._connect() as conn:
|
with self._connect() as conn:
|
||||||
conn.executescript(
|
conn.executescript(
|
||||||
@@ -449,6 +496,15 @@ class SQLiteAssetRegistryRepository:
|
|||||||
status text not null,
|
status text not null,
|
||||||
payload text not null
|
payload text not null
|
||||||
);
|
);
|
||||||
|
create table if not exists ingestion_jobs (
|
||||||
|
id text primary key,
|
||||||
|
status text not null,
|
||||||
|
actor_id text not null,
|
||||||
|
correlation_id text not null,
|
||||||
|
created_at text not null,
|
||||||
|
updated_at text not null,
|
||||||
|
payload text not null
|
||||||
|
);
|
||||||
create index if not exists idx_assets_lifecycle on assets(lifecycle);
|
create index if not exists idx_assets_lifecycle on assets(lifecycle);
|
||||||
create index if not exists idx_representations_asset on representations(asset_id);
|
create index if not exists idx_representations_asset on representations(asset_id);
|
||||||
create index if not exists idx_metadata_asset on metadata_records(asset_id);
|
create index if not exists idx_metadata_asset on metadata_records(asset_id);
|
||||||
@@ -458,6 +514,8 @@ class SQLiteAssetRegistryRepository:
|
|||||||
create index if not exists idx_versions_asset on asset_versions(asset_id);
|
create index if not exists idx_versions_asset on asset_versions(asset_id);
|
||||||
create index if not exists idx_audit_target on audit_events(target);
|
create index if not exists idx_audit_target on audit_events(target);
|
||||||
create index if not exists idx_audit_correlation on audit_events(correlation_id);
|
create index if not exists idx_audit_correlation on audit_events(correlation_id);
|
||||||
|
create index if not exists idx_ingestion_jobs_status on ingestion_jobs(status);
|
||||||
|
create index if not exists idx_ingestion_jobs_correlation on ingestion_jobs(correlation_id);
|
||||||
"""
|
"""
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -4,6 +4,16 @@ from .actors import Actor, ActorType, OperationContext
|
|||||||
from .assets import AssetRepresentation, KnowledgeAsset, RepresentationKind
|
from .assets import AssetRepresentation, KnowledgeAsset, RepresentationKind
|
||||||
from .audit import AuditEvent, AuditOutcome
|
from .audit import AuditEvent, AuditOutcome
|
||||||
from .idempotency import IdempotencyRecord, IdempotencyStatus
|
from .idempotency import IdempotencyRecord, IdempotencyStatus
|
||||||
|
from .ingestion import (
|
||||||
|
ConnectorCapability,
|
||||||
|
ExtractionResult,
|
||||||
|
ExtractorCapability,
|
||||||
|
IngestionFailure,
|
||||||
|
IngestionJob,
|
||||||
|
IngestionJobStatus,
|
||||||
|
NormalizedDocument,
|
||||||
|
SourcePayload,
|
||||||
|
)
|
||||||
from .metadata import Classification, LifecycleState, MetadataRecord, Sensitivity
|
from .metadata import Classification, LifecycleState, MetadataRecord, Sensitivity
|
||||||
from .policy import PolicyDecision, PolicyEffect
|
from .policy import PolicyDecision, PolicyEffect
|
||||||
from .primitives import content_digest, mapping_digest, new_id, stable_json_dumps, utc_now
|
from .primitives import content_digest, mapping_digest, new_id, stable_json_dumps, utc_now
|
||||||
@@ -28,15 +38,22 @@ __all__ = [
|
|||||||
"AuditEvent",
|
"AuditEvent",
|
||||||
"AuditOutcome",
|
"AuditOutcome",
|
||||||
"Classification",
|
"Classification",
|
||||||
|
"ConnectorCapability",
|
||||||
"ContextEntity",
|
"ContextEntity",
|
||||||
"ContextEntityType",
|
"ContextEntityType",
|
||||||
"CoreRelationship",
|
"CoreRelationship",
|
||||||
"DerivedArtifactLineage",
|
"DerivedArtifactLineage",
|
||||||
|
"ExtractionResult",
|
||||||
|
"ExtractorCapability",
|
||||||
"IdempotencyRecord",
|
"IdempotencyRecord",
|
||||||
"IdempotencyStatus",
|
"IdempotencyStatus",
|
||||||
|
"IngestionFailure",
|
||||||
|
"IngestionJob",
|
||||||
|
"IngestionJobStatus",
|
||||||
"KnowledgeAsset",
|
"KnowledgeAsset",
|
||||||
"LifecycleState",
|
"LifecycleState",
|
||||||
"MetadataRecord",
|
"MetadataRecord",
|
||||||
|
"NormalizedDocument",
|
||||||
"OperationContext",
|
"OperationContext",
|
||||||
"PolicyDecision",
|
"PolicyDecision",
|
||||||
"PolicyEffect",
|
"PolicyEffect",
|
||||||
@@ -44,6 +61,7 @@ __all__ = [
|
|||||||
"RepresentationKind",
|
"RepresentationKind",
|
||||||
"Sensitivity",
|
"Sensitivity",
|
||||||
"SourceReference",
|
"SourceReference",
|
||||||
|
"SourcePayload",
|
||||||
"VersionChangeType",
|
"VersionChangeType",
|
||||||
"content_digest",
|
"content_digest",
|
||||||
"mapping_digest",
|
"mapping_digest",
|
||||||
|
|||||||
308
src/kontextual_engine/core/ingestion.py
Normal file
308
src/kontextual_engine/core/ingestion.py
Normal file
@@ -0,0 +1,308 @@
|
|||||||
|
"""Ingestion job and normalized content primitives."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from dataclasses import dataclass, field, replace
|
||||||
|
from enum import Enum
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .primitives import compact_dict, content_digest, mapping_digest, new_id, stable_json_dumps, utc_now
|
||||||
|
from .provenance import SourceReference
|
||||||
|
|
||||||
|
|
||||||
|
class IngestionJobStatus(str, Enum):
|
||||||
|
QUEUED = "queued"
|
||||||
|
RUNNING = "running"
|
||||||
|
COMPLETED = "completed"
|
||||||
|
FAILED = "failed"
|
||||||
|
PARTIALLY_COMPLETED = "partially_completed"
|
||||||
|
RETRIED = "retried"
|
||||||
|
QUARANTINED = "quarantined"
|
||||||
|
CANCELED = "canceled"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class IngestionFailure:
|
||||||
|
code: str
|
||||||
|
message: str
|
||||||
|
retriable: bool = False
|
||||||
|
details: dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, Any]:
|
||||||
|
return compact_dict(
|
||||||
|
{
|
||||||
|
"code": self.code,
|
||||||
|
"message": self.message,
|
||||||
|
"retriable": self.retriable,
|
||||||
|
"details": dict(self.details),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: dict[str, Any]) -> "IngestionFailure":
|
||||||
|
return cls(
|
||||||
|
code=data["code"],
|
||||||
|
message=data["message"],
|
||||||
|
retriable=bool(data.get("retriable", False)),
|
||||||
|
details=dict(data.get("details", {})),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ConnectorCapability:
|
||||||
|
connector_name: str
|
||||||
|
source_types: tuple[str, ...]
|
||||||
|
supports_directories: bool = False
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, Any]:
|
||||||
|
return compact_dict(
|
||||||
|
{
|
||||||
|
"connector_name": self.connector_name,
|
||||||
|
"source_types": list(self.source_types),
|
||||||
|
"supports_directories": self.supports_directories,
|
||||||
|
"metadata": dict(self.metadata),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ExtractorCapability:
|
||||||
|
extractor_name: str
|
||||||
|
media_types: tuple[str, ...]
|
||||||
|
extraction_depth: str = "text"
|
||||||
|
produces_structure: bool = False
|
||||||
|
optional_dependency: str | None = None
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, Any]:
|
||||||
|
return compact_dict(
|
||||||
|
{
|
||||||
|
"extractor_name": self.extractor_name,
|
||||||
|
"media_types": list(self.media_types),
|
||||||
|
"extraction_depth": self.extraction_depth,
|
||||||
|
"produces_structure": self.produces_structure,
|
||||||
|
"optional_dependency": self.optional_dependency,
|
||||||
|
"metadata": dict(self.metadata),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class SourcePayload:
|
||||||
|
connector_name: str
|
||||||
|
source_uri: str
|
||||||
|
source_ref: SourceReference
|
||||||
|
media_type: str
|
||||||
|
content: bytes
|
||||||
|
title: str
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
permission_context: dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def content_digest(self) -> str:
|
||||||
|
return content_digest(self.content)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def size_bytes(self) -> int:
|
||||||
|
return len(self.content)
|
||||||
|
|
||||||
|
def read_text(self, encoding: str = "utf-8") -> str:
|
||||||
|
return self.content.decode(encoding)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class NormalizedDocument:
|
||||||
|
text: str
|
||||||
|
media_type: str = "application/vnd.kontextual.normalized+json"
|
||||||
|
title: str | None = None
|
||||||
|
structure: dict[str, Any] = field(default_factory=dict)
|
||||||
|
tables: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
links: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
fields: dict[str, Any] = field(default_factory=dict)
|
||||||
|
confidence: float | None = None
|
||||||
|
unsupported_elements: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
extractor_metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def normalized_hash(self) -> str:
|
||||||
|
return mapping_digest(self.to_dict(include_hash=False))
|
||||||
|
|
||||||
|
def to_dict(self, *, include_hash: bool = True) -> dict[str, Any]:
|
||||||
|
data = compact_dict(
|
||||||
|
{
|
||||||
|
"title": self.title,
|
||||||
|
"text": self.text,
|
||||||
|
"media_type": self.media_type,
|
||||||
|
"structure": dict(self.structure),
|
||||||
|
"tables": list(self.tables),
|
||||||
|
"links": list(self.links),
|
||||||
|
"fields": dict(self.fields),
|
||||||
|
"confidence": self.confidence,
|
||||||
|
"unsupported_elements": list(self.unsupported_elements),
|
||||||
|
"extractor_metadata": dict(self.extractor_metadata),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
if include_hash:
|
||||||
|
data["normalized_hash"] = self.normalized_hash
|
||||||
|
return data
|
||||||
|
|
||||||
|
def to_json(self) -> str:
|
||||||
|
return stable_json_dumps(self.to_dict())
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ExtractionResult:
|
||||||
|
normalized: NormalizedDocument
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
diagnostics: tuple[IngestionFailure, ...] = ()
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, Any]:
|
||||||
|
return compact_dict(
|
||||||
|
{
|
||||||
|
"normalized": self.normalized.to_dict(),
|
||||||
|
"metadata": dict(self.metadata),
|
||||||
|
"diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class IngestionJob:
|
||||||
|
input: dict[str, Any]
|
||||||
|
actor_id: str
|
||||||
|
correlation_id: str
|
||||||
|
status: IngestionJobStatus = IngestionJobStatus.QUEUED
|
||||||
|
source_ref: SourceReference | None = None
|
||||||
|
output_asset_ids: tuple[str, ...] = ()
|
||||||
|
failures: tuple[IngestionFailure, ...] = ()
|
||||||
|
partial_results: dict[str, Any] = field(default_factory=dict)
|
||||||
|
retry_options: dict[str, Any] = field(default_factory=dict)
|
||||||
|
retry_of_job_id: str | None = None
|
||||||
|
attempts: int = 1
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
job_id: str = field(default_factory=lambda: new_id("ingest"))
|
||||||
|
created_at: str = field(default_factory=lambda: utc_now().isoformat())
|
||||||
|
updated_at: str = field(default_factory=lambda: utc_now().isoformat())
|
||||||
|
completed_at: str | None = None
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def create(
|
||||||
|
cls,
|
||||||
|
*,
|
||||||
|
input: dict[str, Any],
|
||||||
|
actor_id: str,
|
||||||
|
correlation_id: str,
|
||||||
|
retry_of_job_id: str | None = None,
|
||||||
|
metadata: dict[str, Any] | None = None,
|
||||||
|
) -> "IngestionJob":
|
||||||
|
return cls(
|
||||||
|
input=dict(input),
|
||||||
|
actor_id=actor_id,
|
||||||
|
correlation_id=correlation_id,
|
||||||
|
retry_of_job_id=retry_of_job_id,
|
||||||
|
metadata=dict(metadata or {}),
|
||||||
|
)
|
||||||
|
|
||||||
|
def running(self, *, source_ref: SourceReference | None = None) -> "IngestionJob":
|
||||||
|
return replace(
|
||||||
|
self,
|
||||||
|
status=IngestionJobStatus.RUNNING,
|
||||||
|
source_ref=source_ref or self.source_ref,
|
||||||
|
updated_at=utc_now().isoformat(),
|
||||||
|
)
|
||||||
|
|
||||||
|
def completed(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
output_asset_ids: tuple[str, ...],
|
||||||
|
partial_results: dict[str, Any] | None = None,
|
||||||
|
) -> "IngestionJob":
|
||||||
|
now = utc_now().isoformat()
|
||||||
|
return replace(
|
||||||
|
self,
|
||||||
|
status=IngestionJobStatus.COMPLETED,
|
||||||
|
output_asset_ids=tuple(output_asset_ids),
|
||||||
|
partial_results=dict(partial_results or self.partial_results),
|
||||||
|
updated_at=now,
|
||||||
|
completed_at=now,
|
||||||
|
)
|
||||||
|
|
||||||
|
def failed(
|
||||||
|
self,
|
||||||
|
failure: IngestionFailure,
|
||||||
|
*,
|
||||||
|
status: IngestionJobStatus = IngestionJobStatus.FAILED,
|
||||||
|
partial_results: dict[str, Any] | None = None,
|
||||||
|
) -> "IngestionJob":
|
||||||
|
now = utc_now().isoformat()
|
||||||
|
return replace(
|
||||||
|
self,
|
||||||
|
status=status,
|
||||||
|
failures=self.failures + (failure,),
|
||||||
|
partial_results=dict(partial_results or self.partial_results),
|
||||||
|
updated_at=now,
|
||||||
|
completed_at=now,
|
||||||
|
)
|
||||||
|
|
||||||
|
def partially_completed(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
output_asset_ids: tuple[str, ...],
|
||||||
|
failures: tuple[IngestionFailure, ...],
|
||||||
|
partial_results: dict[str, Any],
|
||||||
|
) -> "IngestionJob":
|
||||||
|
now = utc_now().isoformat()
|
||||||
|
return replace(
|
||||||
|
self,
|
||||||
|
status=IngestionJobStatus.PARTIALLY_COMPLETED,
|
||||||
|
output_asset_ids=tuple(output_asset_ids),
|
||||||
|
failures=tuple(failures),
|
||||||
|
partial_results=dict(partial_results),
|
||||||
|
updated_at=now,
|
||||||
|
completed_at=now,
|
||||||
|
)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, Any]:
|
||||||
|
return compact_dict(
|
||||||
|
{
|
||||||
|
"job_id": self.job_id,
|
||||||
|
"status": self.status.value,
|
||||||
|
"input": dict(self.input),
|
||||||
|
"actor_id": self.actor_id,
|
||||||
|
"correlation_id": self.correlation_id,
|
||||||
|
"source_ref": self.source_ref.to_dict() if self.source_ref else None,
|
||||||
|
"output_asset_ids": list(self.output_asset_ids),
|
||||||
|
"failures": [failure.to_dict() for failure in self.failures],
|
||||||
|
"partial_results": dict(self.partial_results),
|
||||||
|
"retry_options": dict(self.retry_options),
|
||||||
|
"retry_of_job_id": self.retry_of_job_id,
|
||||||
|
"attempts": self.attempts,
|
||||||
|
"metadata": dict(self.metadata),
|
||||||
|
"created_at": self.created_at,
|
||||||
|
"updated_at": self.updated_at,
|
||||||
|
"completed_at": self.completed_at,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: dict[str, Any]) -> "IngestionJob":
|
||||||
|
source_ref = data.get("source_ref")
|
||||||
|
return cls(
|
||||||
|
job_id=data["job_id"],
|
||||||
|
status=IngestionJobStatus(data["status"]),
|
||||||
|
input=dict(data.get("input", {})),
|
||||||
|
actor_id=data["actor_id"],
|
||||||
|
correlation_id=data["correlation_id"],
|
||||||
|
source_ref=SourceReference.from_dict(source_ref) if source_ref else None,
|
||||||
|
output_asset_ids=tuple(data.get("output_asset_ids", [])),
|
||||||
|
failures=tuple(IngestionFailure.from_dict(item) for item in data.get("failures", [])),
|
||||||
|
partial_results=dict(data.get("partial_results", {})),
|
||||||
|
retry_options=dict(data.get("retry_options", {})),
|
||||||
|
retry_of_job_id=data.get("retry_of_job_id"),
|
||||||
|
attempts=int(data.get("attempts", 1)),
|
||||||
|
metadata=dict(data.get("metadata", {})),
|
||||||
|
created_at=data["created_at"],
|
||||||
|
updated_at=data["updated_at"],
|
||||||
|
completed_at=data.get("completed_at"),
|
||||||
|
)
|
||||||
@@ -1,11 +1,14 @@
|
|||||||
"""Stable ports owned by the engine."""
|
"""Stable ports owned by the engine."""
|
||||||
|
|
||||||
|
from .ingestion import DirectorySourceConnector, FormatExtractor, SourceConnector
|
||||||
from .policy import AllowAllPolicyGateway, PolicyGateway
|
from .policy import AllowAllPolicyGateway, PolicyGateway
|
||||||
from .repositories import AssetRegistryRepository
|
from .repositories import AssetRegistryRepository
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"AllowAllPolicyGateway",
|
"AllowAllPolicyGateway",
|
||||||
"AssetRegistryRepository",
|
"AssetRegistryRepository",
|
||||||
|
"DirectorySourceConnector",
|
||||||
|
"FormatExtractor",
|
||||||
"PolicyGateway",
|
"PolicyGateway",
|
||||||
|
"SourceConnector",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|||||||
34
src/kontextual_engine/ports/ingestion.py
Normal file
34
src/kontextual_engine/ports/ingestion.py
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
"""Connector and extractor ports for ingestion."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Protocol
|
||||||
|
|
||||||
|
from kontextual_engine.core import (
|
||||||
|
ConnectorCapability,
|
||||||
|
ExtractionResult,
|
||||||
|
ExtractorCapability,
|
||||||
|
SourcePayload,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class SourceConnector(Protocol):
|
||||||
|
name: str
|
||||||
|
|
||||||
|
def capabilities(self) -> ConnectorCapability: ...
|
||||||
|
|
||||||
|
def fetch(self, source_uri: str) -> SourcePayload: ...
|
||||||
|
|
||||||
|
|
||||||
|
class DirectorySourceConnector(SourceConnector, Protocol):
|
||||||
|
def iter_files(self, source_uri: str, *, recursive: bool = True) -> list[str]: ...
|
||||||
|
|
||||||
|
|
||||||
|
class FormatExtractor(Protocol):
|
||||||
|
name: str
|
||||||
|
|
||||||
|
def capabilities(self) -> ExtractorCapability: ...
|
||||||
|
|
||||||
|
def supports(self, media_type: str) -> bool: ...
|
||||||
|
|
||||||
|
def extract(self, payload: SourcePayload) -> ExtractionResult: ...
|
||||||
@@ -12,6 +12,8 @@ from kontextual_engine.core import (
|
|||||||
ContextEntity,
|
ContextEntity,
|
||||||
CoreRelationship,
|
CoreRelationship,
|
||||||
IdempotencyRecord,
|
IdempotencyRecord,
|
||||||
|
IngestionJob,
|
||||||
|
IngestionJobStatus,
|
||||||
KnowledgeAsset,
|
KnowledgeAsset,
|
||||||
LifecycleState,
|
LifecycleState,
|
||||||
MetadataRecord,
|
MetadataRecord,
|
||||||
@@ -71,3 +73,11 @@ class AssetRegistryRepository(Protocol):
|
|||||||
|
|
||||||
def save_idempotency_record(self, record: IdempotencyRecord) -> IdempotencyRecord: ...
|
def save_idempotency_record(self, record: IdempotencyRecord) -> IdempotencyRecord: ...
|
||||||
def get_idempotency_record(self, key: str) -> IdempotencyRecord | None: ...
|
def get_idempotency_record(self, key: str) -> IdempotencyRecord | None: ...
|
||||||
|
|
||||||
|
def save_ingestion_job(self, job: IngestionJob) -> IngestionJob: ...
|
||||||
|
def get_ingestion_job(self, job_id: str) -> IngestionJob: ...
|
||||||
|
def list_ingestion_jobs(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
status: IngestionJobStatus | None = None,
|
||||||
|
) -> list[IngestionJob]: ...
|
||||||
|
|||||||
@@ -1,5 +1,12 @@
|
|||||||
"""Application services for the engine."""
|
"""Application services for the engine."""
|
||||||
|
|
||||||
from .asset_service import AssetChangeResult, AssetRegistryService, RelationshipChangeResult
|
from .asset_service import AssetChangeResult, AssetRegistryService, RelationshipChangeResult
|
||||||
|
from .ingestion_service import AssetIngestionResult, AssetIngestionService
|
||||||
|
|
||||||
__all__ = ["AssetChangeResult", "AssetRegistryService", "RelationshipChangeResult"]
|
__all__ = [
|
||||||
|
"AssetChangeResult",
|
||||||
|
"AssetIngestionResult",
|
||||||
|
"AssetIngestionService",
|
||||||
|
"AssetRegistryService",
|
||||||
|
"RelationshipChangeResult",
|
||||||
|
]
|
||||||
|
|||||||
304
src/kontextual_engine/services/ingestion_service.py
Normal file
304
src/kontextual_engine/services/ingestion_service.py
Normal file
@@ -0,0 +1,304 @@
|
|||||||
|
"""Application service for governed asset ingestion."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
from kontextual_engine.adapters.builtin_extractors import PlainTextExtractor
|
||||||
|
from kontextual_engine.adapters.local_files import LocalFileConnector
|
||||||
|
from kontextual_engine.adapters.markitect_tool import MarkitectMarkdownExtractor
|
||||||
|
from kontextual_engine.core import (
|
||||||
|
AssetRepresentation,
|
||||||
|
Classification,
|
||||||
|
IngestionFailure,
|
||||||
|
IngestionJob,
|
||||||
|
IngestionJobStatus,
|
||||||
|
KnowledgeAsset,
|
||||||
|
MetadataRecord,
|
||||||
|
OperationContext,
|
||||||
|
RepresentationKind,
|
||||||
|
Sensitivity,
|
||||||
|
SourcePayload,
|
||||||
|
mapping_digest,
|
||||||
|
)
|
||||||
|
from kontextual_engine.errors import AdapterUnavailableError, KontextualError
|
||||||
|
from kontextual_engine.ports import AssetRegistryRepository, DirectorySourceConnector, FormatExtractor, SourceConnector
|
||||||
|
|
||||||
|
from .asset_service import AssetChangeResult, AssetRegistryService
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class AssetIngestionResult:
|
||||||
|
job: IngestionJob
|
||||||
|
asset: KnowledgeAsset | None = None
|
||||||
|
asset_change: AssetChangeResult | None = None
|
||||||
|
|
||||||
|
|
||||||
|
class AssetIngestionService:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
repository: AssetRegistryRepository,
|
||||||
|
*,
|
||||||
|
asset_service: AssetRegistryService | None = None,
|
||||||
|
connectors: Iterable[SourceConnector] | None = None,
|
||||||
|
extractors: Iterable[FormatExtractor] | None = None,
|
||||||
|
) -> None:
|
||||||
|
self.repository = repository
|
||||||
|
self.asset_service = asset_service or AssetRegistryService(repository)
|
||||||
|
self.connectors = {connector.name: connector for connector in (connectors or [LocalFileConnector()])}
|
||||||
|
self.extractors = list(extractors or [PlainTextExtractor(), MarkitectMarkdownExtractor()])
|
||||||
|
|
||||||
|
def connector_capabilities(self) -> list[dict]:
|
||||||
|
return [connector.capabilities().to_dict() for connector in self.connectors.values()]
|
||||||
|
|
||||||
|
def extractor_capabilities(self) -> list[dict]:
|
||||||
|
return [extractor.capabilities().to_dict() for extractor in self.extractors]
|
||||||
|
|
||||||
|
def ingest_file(
|
||||||
|
self,
|
||||||
|
path: str | Path,
|
||||||
|
context: OperationContext,
|
||||||
|
*,
|
||||||
|
asset_id: str | None = None,
|
||||||
|
title: str | None = None,
|
||||||
|
classification: Classification | None = None,
|
||||||
|
idempotency_key: str | None = None,
|
||||||
|
) -> AssetIngestionResult:
|
||||||
|
connector = self._connector("local_file")
|
||||||
|
job = IngestionJob.create(
|
||||||
|
input={"connector": connector.name, "source_uri": str(path), "mode": "file"},
|
||||||
|
actor_id=context.actor.id,
|
||||||
|
correlation_id=context.correlation_id,
|
||||||
|
)
|
||||||
|
self.repository.save_ingestion_job(job)
|
||||||
|
try:
|
||||||
|
payload = connector.fetch(str(path))
|
||||||
|
return self._ingest_payload(
|
||||||
|
job,
|
||||||
|
payload,
|
||||||
|
context,
|
||||||
|
asset_id=asset_id,
|
||||||
|
title=title,
|
||||||
|
classification=classification,
|
||||||
|
idempotency_key=idempotency_key,
|
||||||
|
)
|
||||||
|
except Exception as exc:
|
||||||
|
failed = job.failed(_failure_from_exception(exc))
|
||||||
|
self.repository.save_ingestion_job(failed)
|
||||||
|
return AssetIngestionResult(failed)
|
||||||
|
|
||||||
|
def ingest_directory(
|
||||||
|
self,
|
||||||
|
path: str | Path,
|
||||||
|
context: OperationContext,
|
||||||
|
*,
|
||||||
|
recursive: bool = True,
|
||||||
|
classification: Classification | None = None,
|
||||||
|
) -> IngestionJob:
|
||||||
|
connector = self._directory_connector("local_file")
|
||||||
|
job = IngestionJob.create(
|
||||||
|
input={
|
||||||
|
"connector": connector.name,
|
||||||
|
"source_uri": str(path),
|
||||||
|
"mode": "directory",
|
||||||
|
"recursive": recursive,
|
||||||
|
},
|
||||||
|
actor_id=context.actor.id,
|
||||||
|
correlation_id=context.correlation_id,
|
||||||
|
)
|
||||||
|
job = job.running()
|
||||||
|
self.repository.save_ingestion_job(job)
|
||||||
|
|
||||||
|
output_asset_ids: list[str] = []
|
||||||
|
failures: list[IngestionFailure] = []
|
||||||
|
item_results: list[dict] = []
|
||||||
|
files = connector.iter_files(str(path), recursive=recursive)
|
||||||
|
for source_uri in files:
|
||||||
|
result = self.ingest_file(source_uri, context, classification=classification)
|
||||||
|
item = {
|
||||||
|
"source_uri": source_uri,
|
||||||
|
"job_id": result.job.job_id,
|
||||||
|
"status": result.job.status.value,
|
||||||
|
}
|
||||||
|
if result.asset is not None:
|
||||||
|
output_asset_ids.append(result.asset.id)
|
||||||
|
item["asset_id"] = result.asset.id
|
||||||
|
if result.job.failures:
|
||||||
|
failures.extend(result.job.failures)
|
||||||
|
item["failures"] = [failure.to_dict() for failure in result.job.failures]
|
||||||
|
item_results.append(item)
|
||||||
|
|
||||||
|
partial_results = {
|
||||||
|
"files_total": len(files),
|
||||||
|
"succeeded": sum(1 for item in item_results if item["status"] == IngestionJobStatus.COMPLETED.value),
|
||||||
|
"failed": sum(1 for item in item_results if item["status"] == IngestionJobStatus.FAILED.value),
|
||||||
|
"quarantined": sum(1 for item in item_results if item["status"] == IngestionJobStatus.QUARANTINED.value),
|
||||||
|
"skipped": 0,
|
||||||
|
"items": item_results,
|
||||||
|
}
|
||||||
|
if failures and output_asset_ids:
|
||||||
|
job = job.partially_completed(
|
||||||
|
output_asset_ids=tuple(output_asset_ids),
|
||||||
|
failures=tuple(failures),
|
||||||
|
partial_results=partial_results,
|
||||||
|
)
|
||||||
|
elif failures:
|
||||||
|
job = job.failed(
|
||||||
|
IngestionFailure(
|
||||||
|
code="ingestion.directory_failed",
|
||||||
|
message="Directory ingestion failed for all files",
|
||||||
|
retriable=True,
|
||||||
|
details=partial_results,
|
||||||
|
),
|
||||||
|
partial_results=partial_results,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
job = job.completed(output_asset_ids=tuple(output_asset_ids), partial_results=partial_results)
|
||||||
|
self.repository.save_ingestion_job(job)
|
||||||
|
return job
|
||||||
|
|
||||||
|
def get_job(self, job_id: str) -> IngestionJob:
|
||||||
|
return self.repository.get_ingestion_job(job_id)
|
||||||
|
|
||||||
|
def list_jobs(self, *, status: IngestionJobStatus | None = None) -> list[IngestionJob]:
|
||||||
|
return self.repository.list_ingestion_jobs(status=status)
|
||||||
|
|
||||||
|
def _ingest_payload(
|
||||||
|
self,
|
||||||
|
job: IngestionJob,
|
||||||
|
payload: SourcePayload,
|
||||||
|
context: OperationContext,
|
||||||
|
*,
|
||||||
|
asset_id: str | None,
|
||||||
|
title: str | None,
|
||||||
|
classification: Classification | None,
|
||||||
|
idempotency_key: str | None,
|
||||||
|
) -> AssetIngestionResult:
|
||||||
|
job = job.running(source_ref=payload.source_ref)
|
||||||
|
self.repository.save_ingestion_job(job)
|
||||||
|
extractor = self._extractor(payload.media_type)
|
||||||
|
extraction = extractor.extract(payload)
|
||||||
|
resolved_asset_id = asset_id or _stable_asset_id(payload)
|
||||||
|
source_representation = AssetRepresentation.from_content(
|
||||||
|
resolved_asset_id,
|
||||||
|
RepresentationKind.SOURCE,
|
||||||
|
payload.media_type,
|
||||||
|
payload.content,
|
||||||
|
storage_ref=payload.source_uri,
|
||||||
|
producer=payload.connector_name,
|
||||||
|
source_ref_id=payload.source_ref.id,
|
||||||
|
metadata={
|
||||||
|
"connector": payload.connector_name,
|
||||||
|
"source_digest": payload.content_digest,
|
||||||
|
"source_size_bytes": payload.size_bytes,
|
||||||
|
**payload.metadata,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
normalized_representation = AssetRepresentation.from_content(
|
||||||
|
resolved_asset_id,
|
||||||
|
RepresentationKind.NORMALIZED,
|
||||||
|
extraction.normalized.media_type,
|
||||||
|
extraction.normalized.to_json(),
|
||||||
|
producer=extractor.name,
|
||||||
|
source_ref_id=payload.source_ref.id,
|
||||||
|
metadata={
|
||||||
|
"extractor": extractor.name,
|
||||||
|
"normalized_hash": extraction.normalized.normalized_hash,
|
||||||
|
**extraction.metadata,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
asset_change = self.asset_service.create_asset(
|
||||||
|
title or payload.title,
|
||||||
|
classification or Classification(asset_type="document", sensitivity=Sensitivity.INTERNAL),
|
||||||
|
context,
|
||||||
|
asset_id=resolved_asset_id,
|
||||||
|
source_refs=[payload.source_ref],
|
||||||
|
representations=[source_representation, normalized_representation],
|
||||||
|
metadata_records=_metadata_records(payload, extractor.name, extraction.metadata),
|
||||||
|
idempotency_key=idempotency_key,
|
||||||
|
)
|
||||||
|
completed = job.completed(
|
||||||
|
output_asset_ids=(asset_change.asset.id,),
|
||||||
|
partial_results={
|
||||||
|
"connector": payload.connector_name,
|
||||||
|
"extractor": extractor.name,
|
||||||
|
"source_digest": payload.content_digest,
|
||||||
|
"representations": [
|
||||||
|
source_representation.representation_id,
|
||||||
|
normalized_representation.representation_id,
|
||||||
|
],
|
||||||
|
"diagnostics": [diagnostic.to_dict() for diagnostic in extraction.diagnostics],
|
||||||
|
},
|
||||||
|
)
|
||||||
|
self.repository.save_ingestion_job(completed)
|
||||||
|
return AssetIngestionResult(completed, asset_change.asset, asset_change)
|
||||||
|
|
||||||
|
def _connector(self, name: str) -> SourceConnector:
|
||||||
|
try:
|
||||||
|
return self.connectors[name]
|
||||||
|
except KeyError as exc:
|
||||||
|
raise AdapterUnavailableError("Source connector is not registered", details={"connector": name}) from exc
|
||||||
|
|
||||||
|
def _directory_connector(self, name: str) -> DirectorySourceConnector:
|
||||||
|
connector = self._connector(name)
|
||||||
|
if not hasattr(connector, "iter_files"):
|
||||||
|
raise AdapterUnavailableError(
|
||||||
|
"Source connector does not support directory iteration",
|
||||||
|
details={"connector": name},
|
||||||
|
)
|
||||||
|
return connector # type: ignore[return-value]
|
||||||
|
|
||||||
|
def _extractor(self, media_type: str) -> FormatExtractor:
|
||||||
|
for extractor in self.extractors:
|
||||||
|
if extractor.supports(media_type):
|
||||||
|
return extractor
|
||||||
|
raise AdapterUnavailableError(
|
||||||
|
"No extractor registered for media type",
|
||||||
|
details={"media_type": media_type},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _stable_asset_id(payload: SourcePayload) -> str:
|
||||||
|
digest = mapping_digest(
|
||||||
|
{
|
||||||
|
"source_system": payload.source_ref.source_system,
|
||||||
|
"path": payload.source_ref.path,
|
||||||
|
"uri": payload.source_ref.uri,
|
||||||
|
"external_id": payload.source_ref.external_id,
|
||||||
|
"connector_ref": payload.source_ref.connector_ref,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return f"asset-{digest.removeprefix('sha256:')[:20]}"
|
||||||
|
|
||||||
|
|
||||||
|
def _metadata_records(
|
||||||
|
payload: SourcePayload,
|
||||||
|
extractor_name: str,
|
||||||
|
extraction_metadata: dict,
|
||||||
|
) -> list[MetadataRecord]:
|
||||||
|
return [
|
||||||
|
MetadataRecord("source_media_type", payload.media_type, provenance={"producer": payload.connector_name}),
|
||||||
|
MetadataRecord("source_digest", payload.content_digest, provenance={"producer": payload.connector_name}),
|
||||||
|
MetadataRecord("source_size_bytes", payload.size_bytes, provenance={"producer": payload.connector_name}),
|
||||||
|
MetadataRecord("connector", payload.connector_name, provenance={"producer": payload.connector_name}, confirmed=True),
|
||||||
|
MetadataRecord("extractor", extractor_name, provenance={"producer": extractor_name}, confirmed=True),
|
||||||
|
MetadataRecord("extraction", dict(extraction_metadata), provenance={"producer": extractor_name}),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _failure_from_exception(exc: Exception) -> IngestionFailure:
|
||||||
|
if isinstance(exc, KontextualError):
|
||||||
|
return IngestionFailure(
|
||||||
|
code=exc.code,
|
||||||
|
message=str(exc),
|
||||||
|
retriable=isinstance(exc, AdapterUnavailableError),
|
||||||
|
details=dict(exc.details),
|
||||||
|
)
|
||||||
|
return IngestionFailure(
|
||||||
|
code="ingestion.unexpected",
|
||||||
|
message=str(exc),
|
||||||
|
retriable=False,
|
||||||
|
details={"exception_type": type(exc).__name__},
|
||||||
|
)
|
||||||
108
tests/test_asset_ingestion_service.py
Normal file
108
tests/test_asset_ingestion_service.py
Normal file
@@ -0,0 +1,108 @@
|
|||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from kontextual_engine import (
|
||||||
|
Actor,
|
||||||
|
ActorType,
|
||||||
|
AssetIngestionService,
|
||||||
|
Classification,
|
||||||
|
IngestionJobStatus,
|
||||||
|
InMemoryAssetRegistryRepository,
|
||||||
|
LifecycleState,
|
||||||
|
OperationContext,
|
||||||
|
RepresentationKind,
|
||||||
|
Sensitivity,
|
||||||
|
SQLiteAssetRegistryRepository,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_asset_ingestion_service_ingests_plain_text_file_as_governed_asset(tmp_path: Path) -> None:
|
||||||
|
source = tmp_path / "note.txt"
|
||||||
|
source.write_text("hello\nworld\n", encoding="utf-8")
|
||||||
|
repo = InMemoryAssetRegistryRepository()
|
||||||
|
service = AssetIngestionService(repo)
|
||||||
|
|
||||||
|
result = service.ingest_file(
|
||||||
|
source,
|
||||||
|
operation_context(),
|
||||||
|
asset_id="asset-note",
|
||||||
|
classification=Classification(asset_type="note", sensitivity=Sensitivity.INTERNAL),
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result.job.status == IngestionJobStatus.COMPLETED
|
||||||
|
assert result.job.correlation_id == "corr-ingest"
|
||||||
|
assert result.job.output_asset_ids == ("asset-note",)
|
||||||
|
assert result.asset is not None
|
||||||
|
assert result.asset.source_refs[0].source_system == "local_file"
|
||||||
|
assert result.asset.source_refs[0].path == str(source)
|
||||||
|
assert repo.get_ingestion_job(result.job.job_id).status == IngestionJobStatus.COMPLETED
|
||||||
|
assert {item.kind for item in repo.list_representations(asset_id="asset-note")} == {
|
||||||
|
RepresentationKind.SOURCE,
|
||||||
|
RepresentationKind.NORMALIZED,
|
||||||
|
}
|
||||||
|
normalized = repo.list_representations(asset_id="asset-note", kind=RepresentationKind.NORMALIZED)[0]
|
||||||
|
assert normalized.media_type == "application/vnd.kontextual.normalized+json"
|
||||||
|
assert normalized.metadata["extractor"] == "plain-text"
|
||||||
|
assert repo.list_audit_events(target="asset:asset-note")[0].operation == "asset.create"
|
||||||
|
|
||||||
|
|
||||||
|
def test_ingestion_failure_records_job_without_trusting_unsupported_asset(tmp_path: Path) -> None:
|
||||||
|
source = tmp_path / "blob.bin"
|
||||||
|
source.write_bytes(b"\x00\x01\x02")
|
||||||
|
repo = InMemoryAssetRegistryRepository()
|
||||||
|
service = AssetIngestionService(repo)
|
||||||
|
|
||||||
|
result = service.ingest_file(source, operation_context(), asset_id="asset-blob")
|
||||||
|
|
||||||
|
assert result.asset is None
|
||||||
|
assert result.job.status == IngestionJobStatus.FAILED
|
||||||
|
assert result.job.failures[0].code == "kontextual.adapter_unavailable"
|
||||||
|
assert result.job.failures[0].details["media_type"] == "application/octet-stream"
|
||||||
|
assert repo.list_assets() == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_directory_ingestion_reports_partial_results(tmp_path: Path) -> None:
|
||||||
|
(tmp_path / "one.txt").write_text("one", encoding="utf-8")
|
||||||
|
(tmp_path / "two.bin").write_bytes(b"\x00\x01")
|
||||||
|
repo = InMemoryAssetRegistryRepository()
|
||||||
|
service = AssetIngestionService(repo)
|
||||||
|
|
||||||
|
job = service.ingest_directory(tmp_path, operation_context(), recursive=False)
|
||||||
|
|
||||||
|
assert job.status == IngestionJobStatus.PARTIALLY_COMPLETED
|
||||||
|
assert job.partial_results["files_total"] == 2
|
||||||
|
assert job.partial_results["succeeded"] == 1
|
||||||
|
assert job.partial_results["failed"] == 1
|
||||||
|
assert len(job.output_asset_ids) == 1
|
||||||
|
assert len(job.failures) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_sqlite_ingestion_jobs_survive_reinstantiation(tmp_path: Path) -> None:
|
||||||
|
source = tmp_path / "policy.txt"
|
||||||
|
source.write_text("governed ingestion", encoding="utf-8")
|
||||||
|
db_path = tmp_path / "registry.sqlite"
|
||||||
|
repo = SQLiteAssetRegistryRepository(db_path)
|
||||||
|
service = AssetIngestionService(repo)
|
||||||
|
|
||||||
|
result = service.ingest_file(
|
||||||
|
source,
|
||||||
|
operation_context(),
|
||||||
|
asset_id="asset-policy",
|
||||||
|
)
|
||||||
|
|
||||||
|
reloaded = SQLiteAssetRegistryRepository(db_path)
|
||||||
|
job = reloaded.get_ingestion_job(result.job.job_id)
|
||||||
|
|
||||||
|
assert job.status == IngestionJobStatus.COMPLETED
|
||||||
|
assert job.output_asset_ids == ("asset-policy",)
|
||||||
|
assert reloaded.get_asset("asset-policy").lifecycle == LifecycleState.ACTIVE
|
||||||
|
assert len(reloaded.list_representations(asset_id="asset-policy")) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def operation_context() -> OperationContext:
|
||||||
|
actor = Actor.create(
|
||||||
|
ActorType.HUMAN,
|
||||||
|
actor_id="user-ingest",
|
||||||
|
display_name="Ingestion Tester",
|
||||||
|
groups=["engineering"],
|
||||||
|
)
|
||||||
|
return OperationContext.create(actor, correlation_id="corr-ingest")
|
||||||
@@ -4,13 +4,13 @@ type: workplan
|
|||||||
title: "Multi-Format Ingestion And Normalization"
|
title: "Multi-Format Ingestion And Normalization"
|
||||||
domain: markitect
|
domain: markitect
|
||||||
repo: kontextual-engine
|
repo: kontextual-engine
|
||||||
status: todo
|
status: active
|
||||||
owner: codex
|
owner: codex
|
||||||
topic_slug: markitect
|
topic_slug: markitect
|
||||||
planning_priority: high
|
planning_priority: high
|
||||||
planning_order: 6
|
planning_order: 6
|
||||||
created: "2026-05-05"
|
created: "2026-05-05"
|
||||||
updated: "2026-05-05"
|
updated: "2026-05-06"
|
||||||
state_hub_workstream_id: "270c83c0-eaed-4143-99d0-bb3fcfd23758"
|
state_hub_workstream_id: "270c83c0-eaed-4143-99d0-bb3fcfd23758"
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -45,11 +45,21 @@ needed, and snapshot identity. The engine should normalize Markitect results
|
|||||||
into its common representation and preserve source/adapter provenance rather
|
into its common representation and preserve source/adapter provenance rather
|
||||||
than rebuilding Markdown syntax behavior.
|
than rebuilding Markdown syntax behavior.
|
||||||
|
|
||||||
|
## Implementation Status
|
||||||
|
|
||||||
|
As of 2026-05-06, the first ingestion slice is recorded in
|
||||||
|
`docs/ingestion-implementation.md`. It establishes ingestion job primitives,
|
||||||
|
connector/extractor ports, local file ingestion, plain text normalization,
|
||||||
|
Markitect markdown adapter boundaries, directory partial-result reporting, and
|
||||||
|
in-memory/SQLite job persistence. Remaining work is focused on async execution,
|
||||||
|
re-ingestion identity reconciliation, richer structural extraction, quarantine
|
||||||
|
policy checks, and non-text format adapters.
|
||||||
|
|
||||||
## I6.1 - Implement ingestion job model status and retry surface
|
## I6.1 - Implement ingestion job model status and retry surface
|
||||||
|
|
||||||
```task
|
```task
|
||||||
id: KONT-WP-0006-T001
|
id: KONT-WP-0006-T001
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753"
|
state_hub_task_id: "8e5e514a-6eef-42d9-a93c-2458b4c82753"
|
||||||
```
|
```
|
||||||
@@ -68,7 +78,7 @@ Acceptance:
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: KONT-WP-0006-T002
|
id: KONT-WP-0006-T002
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1"
|
state_hub_task_id: "3eafdab5-478d-49d9-a17f-3cd7c8847cb1"
|
||||||
```
|
```
|
||||||
@@ -87,7 +97,7 @@ Acceptance:
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: KONT-WP-0006-T003
|
id: KONT-WP-0006-T003
|
||||||
status: todo
|
status: in_progress
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"
|
state_hub_task_id: "d3e3d4d2-a581-4438-bee7-6fc4161d3925"
|
||||||
```
|
```
|
||||||
@@ -105,7 +115,7 @@ Acceptance:
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: KONT-WP-0006-T004
|
id: KONT-WP-0006-T004
|
||||||
status: todo
|
status: in_progress
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"
|
state_hub_task_id: "63bf2f7e-705d-40ae-a160-75fc508ffb1f"
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user