Files
kontextual-engine/docs/architecture-blueprint.md

16 KiB

Architecture Blueprint

Date: 2026-05-05

Status: planning baseline for the V0.2 knowledge operations roadmap.

This blueprint defines the target architecture for kontextual-engine as a headless knowledge operations engine. It should guide implementation workplans without freezing every internal detail too early.

Architectural Aim

kontextual-engine should make heterogeneous information assets durable, contextual, governed, retrievable, transformable, and agent-operable through a clean backend architecture.

The architecture should optimize for:

  • stable knowledge asset identity,
  • explicit source provenance,
  • source, normalized, and derived representations,
  • governed retrieval and transformation,
  • traceable workflow and job execution,
  • auditability and structured errors,
  • provider-neutral adapters,
  • agent-safe operations,
  • exportability and operational visibility.

Non-Negotiable Rules

  1. Core behavior lives in domain and application services, not in HTTP routes.
  2. All material operations accept explicit actor and operation context.
  3. Permission and policy checks happen before content exposure or mutation.
  4. Source content, normalized representations, and derived artifacts remain distinct.
  5. Derived artifacts preserve lineage to sources, versions, parameters, actor, policy context, and operation run.
  6. Audit records are emitted for material operations by default.
  7. External systems are reached through ports and adapters.
  8. Agent operations are explicit catalog entries, not unrestricted internal method access.
  9. The service API wraps stable contracts; it does not define the domain model.
  10. The MVP can use local-first backends, but contracts must not assume one storage, search, workflow, AI, or policy provider.

System Shape

clients, apps, workflows, agents
  -> service API / SDK / operation catalog
  -> application services
  -> domain core
  -> ports
  -> adapters and infrastructure

The dependency direction should point inward:

api adapters -> application services -> domain core <- repository/search/workflow ports
                                           ^
                                           |
                                  infrastructure adapters

No domain model should import FastAPI, SQLite, HTTP clients, LLM providers, document parsing libraries, or source-system SDKs directly.

Package Layout

The current flat package can evolve incrementally into this shape:

src/kontextual_engine/
  core/
    assets.py              stable asset identity and representations
    metadata.py            metadata, classification, lifecycle, schemas
    relationships.py       typed relationships and contextual entities
    provenance.py          source refs, lineage, versions, changes
    actors.py              actors, delegated actors, operation context
    policy.py              policy inputs, decisions, review requirements
    audit.py               audit events and correlation IDs
    errors.py              structured errors and diagnostics
  services/
    asset_service.py       create/update/retire assets
    ingestion_service.py   submit and complete ingestion jobs
    retrieval_service.py   search, filter, snippets, context retrieval
    transform_service.py   transformation runs and derived artifacts
    workflow_service.py    templates, runs, steps, retries, exceptions
    agent_service.py       bounded agent operation catalog
    export_service.py      export packages and validation
  ports/
    repositories.py        asset, audit, run, export repositories
    object_store.py        source/normalized/derived content storage
    search.py              lexical/semantic search index port
    extractors.py          format extraction and normalization port
    connectors.py          source connector port
    policy.py              authorization and policy decision port
    events.py              event publisher and webhook port
    ai.py                  provider-neutral AI/model operation port
  adapters/
    memory/                deterministic in-memory test adapters
    sqlite/                local-first durable repository
    local_files/           local file/directory connector
    markitect_tool/        markdown syntax adapter
    builtin_extractors/    text/csv/simple document extraction
  api/
    app.py                 FastAPI app factory
    routes/                versioned HTTP routes
    schemas/               API request/response DTOs

Migration can be gradual. Existing modules can be retained temporarily as compatibility facades while new code moves into the layered structure.

Domain Core

The domain core should be deterministic, import-light, and usable without a running service.

Primary entities:

Entity Responsibility
KnowledgeAsset Stable asset identity and current operational state.
SourceReference Origin information: source system, path, URL, external ID, checksum, connector reference.
AssetRepresentation Source, normalized, or derived content form with media type, digest, size, producer, and storage reference.
AssetVersion Traceable version of content, metadata, relationships, lifecycle, or derived output.
MetadataRecord Standard and custom metadata with provenance and confidence where relevant.
Classification Type, topic, sensitivity, lifecycle, operational category, review state.
ContextEntity Person, team, project, case, customer, product, process, source system, topic, or business object.
Relationship Typed link between assets or between an asset and contextual entity.
Actor Human, application, automation, service account, or AI agent identity.
OperationContext Actor, delegated identity, correlation ID, request scope, policy scope, and operation metadata.
PolicyDecision Allow, deny, redact, require review, dry-run only, or fail-closed result.
AuditEvent Material operation record with actor, target, operation, outcome, correlation ID, policy context.
IngestionJob Observable ingestion request and status.
TransformationRun Traceable operation over assets producing derived artifacts.
WorkflowTemplate Reusable workflow definition with steps, dependencies, inputs, outputs, and policy.
WorkflowRun Executed workflow instance and step state.
ExportPackage Governed export with manifest, integrity data, and selected records.

The old Artifact vocabulary can map to KnowledgeAsset and AssetRepresentation. The old Collection vocabulary remains useful as an organizational container, but assets should eventually support multiple collections/scopes where needed.

Application Services

Application services coordinate domain rules and ports. They should be thin but not anemic; this is where operation ordering, policy checks, audit emission, and repository updates meet.

Every material service method should follow this pattern:

validate input
resolve actor and operation context
load required state
authorize through policy port
perform deterministic domain change or submit job
persist changes
emit audit and events
return typed result or structured error

Suggested service boundaries:

  • AssetService: create, retrieve, update, retire, delete request, metadata, classification, versioning, relationship changes.
  • IngestionService: submit ingestion, run extraction, validate normalized output, quarantine failures, reconcile re-ingestion.
  • RetrievalService: query, text search, filters, context graph retrieval, snippets, permission filtering, feedback.
  • TransformationService: operation registry, transformation runs, derived artifacts, lineage, review requirements.
  • WorkflowService: workflow templates, run execution, retries, cancel, resume, exception queues, human tasks.
  • AgentService: bounded agent operations, context packages, dry runs, review gates, agent audit.
  • ExportService: package selection, manifest, integrity validation, permission-aware export.

Ports And Adapters

Ports are stable interfaces owned by the engine. Adapters are replaceable implementations.

Required MVP ports:

  • Repository port for assets, representations, metadata, relationships, versions, runs, audit events, and exports.
  • Object/content store port for source, normalized, and derived content payloads.
  • Search index port for lexical search and later semantic/hybrid retrieval.
  • Extractor port for format-specific normalization.
  • Connector port for source systems.
  • Policy decision port for authorization and review requirements.
  • Event publisher port for observability, webhooks, and integration.
  • AI/model port for provider-neutral summarization, classification, extraction, embedding, or generation when enabled.

Adapter rules:

  • markitect-tool is an adapter for markdown syntax, selector extraction, deterministic markdown operations, snapshot identity, contracts/runtime checks, and context-package interoperability. Engine domain code must not import it directly; adapter code should persist serializable Markitect outputs as adapter provenance or representation metadata.
  • llm-connect or equivalent is an adapter for LLM providers.
  • phase-memory is an adjacent memory runtime; this engine may exchange opaque memory references or context packages but should not implement memory phases.
  • SQLite is an MVP repository adapter, not the domain model.
  • Semantic/vector search is an optional retrieval adapter, not the definition of retrieval.

Persistence Blueprint

Use SQLite first for local-first durability and tests that prove state survives repository re-instantiation.

Core tables should map to stable domain concepts:

  • assets,
  • source references,
  • representations,
  • metadata records,
  • classifications,
  • contextual entities,
  • relationships,
  • versions,
  • change records,
  • actors,
  • policy assignments or policy references,
  • audit events,
  • ingestion jobs,
  • transformation runs,
  • workflow templates,
  • workflow runs and step runs,
  • export packages and manifests.

Recommended storage style:

  • Relational columns for identifiers, types, status, timestamps, digests, foreign keys, and lifecycle fields.
  • JSON columns for flexible metadata, extractor details, policy context, and adapter-specific payloads.
  • Separate content/object references for large source, normalized, or derived payloads.
  • Append-only audit events and change records.
  • Deterministic ordering fields for pagination and tests.

Do not store permission-sensitive content in search indexes unless the retrieval layer can enforce permissions before exposing results.

Retrieval Blueprint

MVP retrieval should be useful before semantic search:

  1. Retrieve by stable asset ID.
  2. Filter by metadata, classification, lifecycle, source, collection, and time.
  3. Search normalized text lexically.
  4. Retrieve by relationship and contextual entity.
  5. Return source-grounded snippets and explanation data.
  6. Enforce permissions before returning content, snippets, or relationship data.
  7. Capture feedback and quality signals.

Later retrieval can add:

  • semantic/vector search,
  • hybrid ranking,
  • facets and aggregations,
  • grounded answer packages,
  • federated external-source retrieval.

The retrieval contract should not expose backend-specific ranking internals as stable API.

Workflow And Transformation Blueprint

Transformations and workflows should share a common run model.

Transformation:

source assets + versions + parameters + actor + policy context
  -> transformation run
  -> derived artifact representation
  -> lineage + audit + event

Workflow:

template + inputs + actor + trigger
  -> workflow run
  -> step runs
  -> assets / metadata / relationships / derived artifacts / review tasks
  -> audit + events + metrics

MVP execution can be embedded and synchronous/asynchronous-lite. The contracts should still allow later replacement with a queue or external workflow engine.

Operation states should include queued, running, waiting, completed, failed, partially completed, retried, canceled, quarantined, and review required where applicable.

Policy, Governance, And Audit Blueprint

Policy is part of the core operating model, not a UI feature.

Policy inputs:

  • actor and delegated actor,
  • role and group membership,
  • operation type,
  • source-system permission context,
  • sensitivity,
  • lifecycle state,
  • review state,
  • asset policy,
  • workflow state,
  • requested output or export scope.

Policy outcomes:

  • allow,
  • deny,
  • redact,
  • require review,
  • dry-run only,
  • fail closed.

Audit should record material operations:

  • asset creation and updates,
  • ingestion,
  • metadata and classification changes,
  • relationship changes,
  • permission or policy changes,
  • retrieval/query where configured,
  • transformation runs,
  • workflow actions,
  • export,
  • agent operations,
  • administrative recovery actions.

Agent-Safe Operation Blueprint

Agents are actors with explicit scope. They must not receive implicit privileged access.

Agent operations should be listed in a bounded catalog:

  • inspect asset,
  • search assets,
  • retrieve permitted snippets,
  • assemble context package,
  • propose metadata enrichment,
  • propose classification,
  • request transformation,
  • invoke workflow,
  • submit review result,
  • dry-run change,
  • report generated output.

Each operation declares:

  • input schema,
  • output schema,
  • required permissions,
  • policy checks,
  • audit behavior,
  • review-gate behavior,
  • failure modes,
  • whether dry-run is supported.

Context packages should contain selected assets, snippets, metadata, relationships, provenance, task instructions, and policy constraints. They should be inspectable and bounded; they are not a back door to unrestricted repository access.

Service API Blueprint

The FastAPI service should be an adapter over application services.

Endpoint groups:

  • /v1/assets
  • /v1/metadata
  • /v1/relationships
  • /v1/ingestion/jobs
  • /v1/retrieval/query
  • /v1/transformations
  • /v1/workflows
  • /v1/audit
  • /v1/policies
  • /v1/agents/operations
  • /v1/context-packages
  • /v1/exports
  • /v1/admin
  • /health, /ready, /version

API DTOs may differ from domain objects. Keep mapping explicit so the domain can evolve without leaking internal storage shape.

Observability And Export Blueprint

Observability must cover both system operation and product quality.

Operational signals:

  • ingestion throughput,
  • source-update-to-index latency,
  • query latency,
  • API latency,
  • workflow completion rate,
  • job failure rate,
  • queue age,
  • storage/index health,
  • policy failures,
  • audit completeness.

Quality signals:

  • retrieval precision hooks,
  • zero-result rate,
  • low-confidence result rate,
  • citation precision,
  • unsupported-claim rate where AI adapters are used,
  • manual correction rate,
  • review turnaround time.

Export packages should include:

  • selected assets,
  • source and normalized representations where policy permits,
  • metadata,
  • relationships,
  • provenance,
  • versions,
  • audit references,
  • derived artifacts,
  • manifest,
  • schema version,
  • checksums,
  • actor and policy context.

Implementation Sequence

  1. Finish KONT-WP-0004 by turning this blueprint into concrete ADRs and module migration decisions.
  2. Build KONT-WP-0005 first as the governed asset registry foundation.
  3. Add ingestion jobs and source/format adapters in KONT-WP-0006.
  4. Build permission-aware retrieval and context graph behavior in KONT-WP-0007.
  5. Add transformations, derived artifacts, and workflow jobs in KONT-WP-0008.
  6. Expose service and agent-safe APIs in KONT-WP-0009.
  7. Add observability, export, and enterprise-readiness surfaces in KONT-WP-0010.

Review Checklist

Use this checklist before accepting significant implementation changes:

  • Does the change preserve stable asset identity?
  • Does it distinguish source, normalized, and derived representations?
  • Does every material operation have actor context?
  • Are permission checks applied before content exposure or mutation?
  • Are audit events emitted or explicitly deferred?
  • Are errors structured and traceable with correlation IDs?
  • Does the code depend inward on domain contracts rather than outward on infrastructure?
  • Is the extension point a port owned by the engine?
  • Can the feature work with in-memory tests and a durable backend?
  • Can an agent use the feature only through explicit bounded operations?