generated from coulomb/repo-seed
478 lines
16 KiB
Markdown
478 lines
16 KiB
Markdown
# Architecture Blueprint
|
|
|
|
Date: 2026-05-05
|
|
|
|
Status: planning baseline for the V0.2 knowledge operations roadmap.
|
|
|
|
This blueprint defines the target architecture for `kontextual-engine` as a
|
|
headless knowledge operations engine. It should guide implementation workplans
|
|
without freezing every internal detail too early.
|
|
|
|
## Architectural Aim
|
|
|
|
`kontextual-engine` should make heterogeneous information assets durable,
|
|
contextual, governed, retrievable, transformable, and agent-operable through a
|
|
clean backend architecture.
|
|
|
|
The architecture should optimize for:
|
|
|
|
- stable knowledge asset identity,
|
|
- explicit source provenance,
|
|
- source, normalized, and derived representations,
|
|
- governed retrieval and transformation,
|
|
- traceable workflow and job execution,
|
|
- auditability and structured errors,
|
|
- provider-neutral adapters,
|
|
- agent-safe operations,
|
|
- exportability and operational visibility.
|
|
|
|
## Non-Negotiable Rules
|
|
|
|
1. Core behavior lives in domain and application services, not in HTTP routes.
|
|
2. All material operations accept explicit actor and operation context.
|
|
3. Permission and policy checks happen before content exposure or mutation.
|
|
4. Source content, normalized representations, and derived artifacts remain
|
|
distinct.
|
|
5. Derived artifacts preserve lineage to sources, versions, parameters, actor,
|
|
policy context, and operation run.
|
|
6. Audit records are emitted for material operations by default.
|
|
7. External systems are reached through ports and adapters.
|
|
8. Agent operations are explicit catalog entries, not unrestricted internal
|
|
method access.
|
|
9. The service API wraps stable contracts; it does not define the domain model.
|
|
10. The MVP can use local-first backends, but contracts must not assume one
|
|
storage, search, workflow, AI, or policy provider.
|
|
|
|
## System Shape
|
|
|
|
```text
|
|
clients, apps, workflows, agents
|
|
-> service API / SDK / operation catalog
|
|
-> application services
|
|
-> domain core
|
|
-> ports
|
|
-> adapters and infrastructure
|
|
```
|
|
|
|
The dependency direction should point inward:
|
|
|
|
```text
|
|
api adapters -> application services -> domain core <- repository/search/workflow ports
|
|
^
|
|
|
|
|
infrastructure adapters
|
|
```
|
|
|
|
No domain model should import FastAPI, SQLite, HTTP clients, LLM providers,
|
|
document parsing libraries, or source-system SDKs directly.
|
|
|
|
## Package Layout
|
|
|
|
The current flat package can evolve incrementally into this shape:
|
|
|
|
```text
|
|
src/kontextual_engine/
|
|
core/
|
|
assets.py stable asset identity and representations
|
|
metadata.py metadata, classification, lifecycle, schemas
|
|
relationships.py typed relationships and contextual entities
|
|
provenance.py source refs, lineage, versions, changes
|
|
actors.py actors, delegated actors, operation context
|
|
policy.py policy inputs, decisions, review requirements
|
|
audit.py audit events and correlation IDs
|
|
errors.py structured errors and diagnostics
|
|
services/
|
|
asset_service.py create/update/retire assets
|
|
ingestion_service.py submit and complete ingestion jobs
|
|
retrieval_service.py search, filter, snippets, context retrieval
|
|
transform_service.py transformation runs and derived artifacts
|
|
workflow_service.py templates, runs, steps, retries, exceptions
|
|
agent_service.py bounded agent operation catalog
|
|
export_service.py export packages and validation
|
|
ports/
|
|
repositories.py asset, audit, run, export repositories
|
|
object_store.py source/normalized/derived content storage
|
|
search.py lexical/semantic search index port
|
|
extractors.py format extraction and normalization port
|
|
connectors.py source connector port
|
|
policy.py authorization and policy decision port
|
|
events.py event publisher and webhook port
|
|
ai.py provider-neutral AI/model operation port
|
|
adapters/
|
|
memory/ deterministic in-memory test adapters
|
|
sqlite/ local-first durable repository
|
|
local_files/ local file/directory connector
|
|
markitect_tool/ markdown syntax adapter
|
|
builtin_extractors/ text/csv/simple document extraction
|
|
api/
|
|
app.py FastAPI app factory
|
|
routes/ versioned HTTP routes
|
|
schemas/ API request/response DTOs
|
|
```
|
|
|
|
Migration can be gradual. Existing modules can be retained temporarily as
|
|
compatibility facades while new code moves into the layered structure.
|
|
|
|
## Domain Core
|
|
|
|
The domain core should be deterministic, import-light, and usable without a
|
|
running service.
|
|
|
|
Primary entities:
|
|
|
|
| Entity | Responsibility |
|
|
| --- | --- |
|
|
| `KnowledgeAsset` | Stable asset identity and current operational state. |
|
|
| `SourceReference` | Origin information: source system, path, URL, external ID, checksum, connector reference. |
|
|
| `AssetRepresentation` | Source, normalized, or derived content form with media type, digest, size, producer, and storage reference. |
|
|
| `AssetVersion` | Traceable version of content, metadata, relationships, lifecycle, or derived output. |
|
|
| `MetadataRecord` | Standard and custom metadata with provenance and confidence where relevant. |
|
|
| `Classification` | Type, topic, sensitivity, lifecycle, operational category, review state. |
|
|
| `ContextEntity` | Person, team, project, case, customer, product, process, source system, topic, or business object. |
|
|
| `Relationship` | Typed link between assets or between an asset and contextual entity. |
|
|
| `Actor` | Human, application, automation, service account, or AI agent identity. |
|
|
| `OperationContext` | Actor, delegated identity, correlation ID, request scope, policy scope, and operation metadata. |
|
|
| `PolicyDecision` | Allow, deny, redact, require review, dry-run only, or fail-closed result. |
|
|
| `AuditEvent` | Material operation record with actor, target, operation, outcome, correlation ID, policy context. |
|
|
| `IngestionJob` | Observable ingestion request and status. |
|
|
| `TransformationRun` | Traceable operation over assets producing derived artifacts. |
|
|
| `WorkflowTemplate` | Reusable workflow definition with steps, dependencies, inputs, outputs, and policy. |
|
|
| `WorkflowRun` | Executed workflow instance and step state. |
|
|
| `ExportPackage` | Governed export with manifest, integrity data, and selected records. |
|
|
|
|
The old `Artifact` vocabulary can map to `KnowledgeAsset` and
|
|
`AssetRepresentation`. The old `Collection` vocabulary remains useful as an
|
|
organizational container, but assets should eventually support multiple
|
|
collections/scopes where needed.
|
|
|
|
## Application Services
|
|
|
|
Application services coordinate domain rules and ports. They should be thin but
|
|
not anemic; this is where operation ordering, policy checks, audit emission, and
|
|
repository updates meet.
|
|
|
|
Every material service method should follow this pattern:
|
|
|
|
```text
|
|
validate input
|
|
resolve actor and operation context
|
|
load required state
|
|
authorize through policy port
|
|
perform deterministic domain change or submit job
|
|
persist changes
|
|
emit audit and events
|
|
return typed result or structured error
|
|
```
|
|
|
|
Suggested service boundaries:
|
|
|
|
- `AssetService`: create, retrieve, update, retire, delete request, metadata,
|
|
classification, versioning, relationship changes.
|
|
- `IngestionService`: submit ingestion, run extraction, validate normalized
|
|
output, quarantine failures, reconcile re-ingestion.
|
|
- `RetrievalService`: query, text search, filters, context graph retrieval,
|
|
snippets, permission filtering, feedback.
|
|
- `TransformationService`: operation registry, transformation runs, derived
|
|
artifacts, lineage, review requirements.
|
|
- `WorkflowService`: workflow templates, run execution, retries, cancel,
|
|
resume, exception queues, human tasks.
|
|
- `AgentService`: bounded agent operations, context packages, dry runs, review
|
|
gates, agent audit.
|
|
- `ExportService`: package selection, manifest, integrity validation,
|
|
permission-aware export.
|
|
|
|
## Ports And Adapters
|
|
|
|
Ports are stable interfaces owned by the engine. Adapters are replaceable
|
|
implementations.
|
|
|
|
Required MVP ports:
|
|
|
|
- Repository port for assets, representations, metadata, relationships,
|
|
versions, runs, audit events, and exports.
|
|
- Object/content store port for source, normalized, and derived content payloads.
|
|
- Search index port for lexical search and later semantic/hybrid retrieval.
|
|
- Extractor port for format-specific normalization.
|
|
- Connector port for source systems.
|
|
- Policy decision port for authorization and review requirements.
|
|
- Event publisher port for observability, webhooks, and integration.
|
|
- AI/model port for provider-neutral summarization, classification, extraction,
|
|
embedding, or generation when enabled.
|
|
|
|
Adapter rules:
|
|
|
|
- `markitect-tool` is an adapter for markdown syntax, selector extraction,
|
|
deterministic markdown operations, snapshot identity, contracts/runtime
|
|
checks, and context-package interoperability. Engine domain code must not
|
|
import it directly; adapter code should persist serializable Markitect
|
|
outputs as adapter provenance or representation metadata.
|
|
- `llm-connect` or equivalent is an adapter for LLM providers.
|
|
- `phase-memory` is an adjacent memory runtime; this engine may exchange opaque
|
|
memory references or context packages but should not implement memory phases.
|
|
- SQLite is an MVP repository adapter, not the domain model.
|
|
- Semantic/vector search is an optional retrieval adapter, not the definition of
|
|
retrieval.
|
|
|
|
## Persistence Blueprint
|
|
|
|
Use SQLite first for local-first durability and tests that prove state survives
|
|
repository re-instantiation.
|
|
|
|
Core tables should map to stable domain concepts:
|
|
|
|
- assets,
|
|
- source references,
|
|
- representations,
|
|
- metadata records,
|
|
- classifications,
|
|
- contextual entities,
|
|
- relationships,
|
|
- versions,
|
|
- change records,
|
|
- actors,
|
|
- policy assignments or policy references,
|
|
- audit events,
|
|
- ingestion jobs,
|
|
- transformation runs,
|
|
- workflow templates,
|
|
- workflow runs and step runs,
|
|
- export packages and manifests.
|
|
|
|
Recommended storage style:
|
|
|
|
- Relational columns for identifiers, types, status, timestamps, digests,
|
|
foreign keys, and lifecycle fields.
|
|
- JSON columns for flexible metadata, extractor details, policy context, and
|
|
adapter-specific payloads.
|
|
- Separate content/object references for large source, normalized, or derived
|
|
payloads.
|
|
- Append-only audit events and change records.
|
|
- Deterministic ordering fields for pagination and tests.
|
|
|
|
Do not store permission-sensitive content in search indexes unless the retrieval
|
|
layer can enforce permissions before exposing results.
|
|
|
|
## Retrieval Blueprint
|
|
|
|
MVP retrieval should be useful before semantic search:
|
|
|
|
1. Retrieve by stable asset ID.
|
|
2. Filter by metadata, classification, lifecycle, source, collection, and time.
|
|
3. Search normalized text lexically.
|
|
4. Retrieve by relationship and contextual entity.
|
|
5. Return source-grounded snippets and explanation data.
|
|
6. Enforce permissions before returning content, snippets, or relationship data.
|
|
7. Capture feedback and quality signals.
|
|
|
|
Later retrieval can add:
|
|
|
|
- semantic/vector search,
|
|
- hybrid ranking,
|
|
- facets and aggregations,
|
|
- grounded answer packages,
|
|
- federated external-source retrieval.
|
|
|
|
The retrieval contract should not expose backend-specific ranking internals as
|
|
stable API.
|
|
|
|
## Workflow And Transformation Blueprint
|
|
|
|
Transformations and workflows should share a common run model.
|
|
|
|
Transformation:
|
|
|
|
```text
|
|
source assets + versions + parameters + actor + policy context
|
|
-> transformation run
|
|
-> derived artifact representation
|
|
-> lineage + audit + event
|
|
```
|
|
|
|
Workflow:
|
|
|
|
```text
|
|
template + inputs + actor + trigger
|
|
-> workflow run
|
|
-> step runs
|
|
-> assets / metadata / relationships / derived artifacts / review tasks
|
|
-> audit + events + metrics
|
|
```
|
|
|
|
MVP execution can be embedded and synchronous/asynchronous-lite. The contracts
|
|
should still allow later replacement with a queue or external workflow engine.
|
|
|
|
Operation states should include queued, running, waiting, completed, failed,
|
|
partially completed, retried, canceled, quarantined, and review required where
|
|
applicable.
|
|
|
|
## Policy, Governance, And Audit Blueprint
|
|
|
|
Policy is part of the core operating model, not a UI feature.
|
|
|
|
Policy inputs:
|
|
|
|
- actor and delegated actor,
|
|
- role and group membership,
|
|
- operation type,
|
|
- source-system permission context,
|
|
- sensitivity,
|
|
- lifecycle state,
|
|
- review state,
|
|
- asset policy,
|
|
- workflow state,
|
|
- requested output or export scope.
|
|
|
|
Policy outcomes:
|
|
|
|
- allow,
|
|
- deny,
|
|
- redact,
|
|
- require review,
|
|
- dry-run only,
|
|
- fail closed.
|
|
|
|
Audit should record material operations:
|
|
|
|
- asset creation and updates,
|
|
- ingestion,
|
|
- metadata and classification changes,
|
|
- relationship changes,
|
|
- permission or policy changes,
|
|
- retrieval/query where configured,
|
|
- transformation runs,
|
|
- workflow actions,
|
|
- export,
|
|
- agent operations,
|
|
- administrative recovery actions.
|
|
|
|
## Agent-Safe Operation Blueprint
|
|
|
|
Agents are actors with explicit scope. They must not receive implicit privileged
|
|
access.
|
|
|
|
Agent operations should be listed in a bounded catalog:
|
|
|
|
- inspect asset,
|
|
- search assets,
|
|
- retrieve permitted snippets,
|
|
- assemble context package,
|
|
- propose metadata enrichment,
|
|
- propose classification,
|
|
- request transformation,
|
|
- invoke workflow,
|
|
- submit review result,
|
|
- dry-run change,
|
|
- report generated output.
|
|
|
|
Each operation declares:
|
|
|
|
- input schema,
|
|
- output schema,
|
|
- required permissions,
|
|
- policy checks,
|
|
- audit behavior,
|
|
- review-gate behavior,
|
|
- failure modes,
|
|
- whether dry-run is supported.
|
|
|
|
Context packages should contain selected assets, snippets, metadata,
|
|
relationships, provenance, task instructions, and policy constraints. They
|
|
should be inspectable and bounded; they are not a back door to unrestricted
|
|
repository access.
|
|
|
|
## Service API Blueprint
|
|
|
|
The FastAPI service should be an adapter over application services.
|
|
|
|
Endpoint groups:
|
|
|
|
- `/v1/assets`
|
|
- `/v1/metadata`
|
|
- `/v1/relationships`
|
|
- `/v1/ingestion/jobs`
|
|
- `/v1/retrieval/query`
|
|
- `/v1/transformations`
|
|
- `/v1/workflows`
|
|
- `/v1/audit`
|
|
- `/v1/policies`
|
|
- `/v1/agents/operations`
|
|
- `/v1/context-packages`
|
|
- `/v1/exports`
|
|
- `/v1/admin`
|
|
- `/health`, `/ready`, `/version`
|
|
|
|
API DTOs may differ from domain objects. Keep mapping explicit so the domain can
|
|
evolve without leaking internal storage shape.
|
|
|
|
## Observability And Export Blueprint
|
|
|
|
Observability must cover both system operation and product quality.
|
|
|
|
Operational signals:
|
|
|
|
- ingestion throughput,
|
|
- source-update-to-index latency,
|
|
- query latency,
|
|
- API latency,
|
|
- workflow completion rate,
|
|
- job failure rate,
|
|
- queue age,
|
|
- storage/index health,
|
|
- policy failures,
|
|
- audit completeness.
|
|
|
|
Quality signals:
|
|
|
|
- retrieval precision hooks,
|
|
- zero-result rate,
|
|
- low-confidence result rate,
|
|
- citation precision,
|
|
- unsupported-claim rate where AI adapters are used,
|
|
- manual correction rate,
|
|
- review turnaround time.
|
|
|
|
Export packages should include:
|
|
|
|
- selected assets,
|
|
- source and normalized representations where policy permits,
|
|
- metadata,
|
|
- relationships,
|
|
- provenance,
|
|
- versions,
|
|
- audit references,
|
|
- derived artifacts,
|
|
- manifest,
|
|
- schema version,
|
|
- checksums,
|
|
- actor and policy context.
|
|
|
|
## Implementation Sequence
|
|
|
|
1. Finish `KONT-WP-0004` by turning this blueprint into concrete ADRs and
|
|
module migration decisions.
|
|
2. Build `KONT-WP-0005` first as the governed asset registry foundation.
|
|
3. Add ingestion jobs and source/format adapters in `KONT-WP-0006`.
|
|
4. Build permission-aware retrieval and context graph behavior in
|
|
`KONT-WP-0007`.
|
|
5. Add transformations, derived artifacts, and workflow jobs in
|
|
`KONT-WP-0008`.
|
|
6. Expose service and agent-safe APIs in `KONT-WP-0009`.
|
|
7. Add observability, export, and enterprise-readiness surfaces in
|
|
`KONT-WP-0010`.
|
|
|
|
## Review Checklist
|
|
|
|
Use this checklist before accepting significant implementation changes:
|
|
|
|
- Does the change preserve stable asset identity?
|
|
- Does it distinguish source, normalized, and derived representations?
|
|
- Does every material operation have actor context?
|
|
- Are permission checks applied before content exposure or mutation?
|
|
- Are audit events emitted or explicitly deferred?
|
|
- Are errors structured and traceable with correlation IDs?
|
|
- Does the code depend inward on domain contracts rather than outward on
|
|
infrastructure?
|
|
- Is the extension point a port owned by the engine?
|
|
- Can the feature work with in-memory tests and a durable backend?
|
|
- Can an agent use the feature only through explicit bounded operations?
|