diff --git a/README.md b/README.md index 556a689..f988048 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,7 @@ Start here: - `wiki/kontextual-engine_scope_research_md_bundle/` - `SCOPE.md` - `docs/knowledge-operations-roadmap.md` +- `docs/architecture-blueprint.md` - `docs/stack-decision.md` - `docs/markitect-main-scope-assessment.md` - `docs/markitect-tool-reuse-boundary.md` diff --git a/docs/architecture-blueprint.md b/docs/architecture-blueprint.md new file mode 100644 index 0000000..92cc77f --- /dev/null +++ b/docs/architecture-blueprint.md @@ -0,0 +1,474 @@ +# Architecture Blueprint + +Date: 2026-05-05 + +Status: planning baseline for the V0.2 knowledge operations roadmap. + +This blueprint defines the target architecture for `kontextual-engine` as a +headless knowledge operations engine. It should guide implementation workplans +without freezing every internal detail too early. + +## Architectural Aim + +`kontextual-engine` should make heterogeneous information assets durable, +contextual, governed, retrievable, transformable, and agent-operable through a +clean backend architecture. + +The architecture should optimize for: + +- stable knowledge asset identity, +- explicit source provenance, +- source, normalized, and derived representations, +- governed retrieval and transformation, +- traceable workflow and job execution, +- auditability and structured errors, +- provider-neutral adapters, +- agent-safe operations, +- exportability and operational visibility. + +## Non-Negotiable Rules + +1. Core behavior lives in domain and application services, not in HTTP routes. +2. All material operations accept explicit actor and operation context. +3. Permission and policy checks happen before content exposure or mutation. +4. Source content, normalized representations, and derived artifacts remain + distinct. +5. Derived artifacts preserve lineage to sources, versions, parameters, actor, + policy context, and operation run. +6. Audit records are emitted for material operations by default. +7. External systems are reached through ports and adapters. +8. Agent operations are explicit catalog entries, not unrestricted internal + method access. +9. The service API wraps stable contracts; it does not define the domain model. +10. The MVP can use local-first backends, but contracts must not assume one + storage, search, workflow, AI, or policy provider. + +## System Shape + +```text +clients, apps, workflows, agents + -> service API / SDK / operation catalog + -> application services + -> domain core + -> ports + -> adapters and infrastructure +``` + +The dependency direction should point inward: + +```text +api adapters -> application services -> domain core <- repository/search/workflow ports + ^ + | + infrastructure adapters +``` + +No domain model should import FastAPI, SQLite, HTTP clients, LLM providers, +document parsing libraries, or source-system SDKs directly. + +## Package Layout + +The current flat package can evolve incrementally into this shape: + +```text +src/kontextual_engine/ + core/ + assets.py stable asset identity and representations + metadata.py metadata, classification, lifecycle, schemas + relationships.py typed relationships and contextual entities + provenance.py source refs, lineage, versions, changes + actors.py actors, delegated actors, operation context + policy.py policy inputs, decisions, review requirements + audit.py audit events and correlation IDs + errors.py structured errors and diagnostics + services/ + asset_service.py create/update/retire assets + ingestion_service.py submit and complete ingestion jobs + retrieval_service.py search, filter, snippets, context retrieval + transform_service.py transformation runs and derived artifacts + workflow_service.py templates, runs, steps, retries, exceptions + agent_service.py bounded agent operation catalog + export_service.py export packages and validation + ports/ + repositories.py asset, audit, run, export repositories + object_store.py source/normalized/derived content storage + search.py lexical/semantic search index port + extractors.py format extraction and normalization port + connectors.py source connector port + policy.py authorization and policy decision port + events.py event publisher and webhook port + ai.py provider-neutral AI/model operation port + adapters/ + memory/ deterministic in-memory test adapters + sqlite/ local-first durable repository + local_files/ local file/directory connector + markitect_tool/ markdown syntax adapter + builtin_extractors/ text/csv/simple document extraction + api/ + app.py FastAPI app factory + routes/ versioned HTTP routes + schemas/ API request/response DTOs +``` + +Migration can be gradual. Existing modules can be retained temporarily as +compatibility facades while new code moves into the layered structure. + +## Domain Core + +The domain core should be deterministic, import-light, and usable without a +running service. + +Primary entities: + +| Entity | Responsibility | +| --- | --- | +| `KnowledgeAsset` | Stable asset identity and current operational state. | +| `SourceReference` | Origin information: source system, path, URL, external ID, checksum, connector reference. | +| `AssetRepresentation` | Source, normalized, or derived content form with media type, digest, size, producer, and storage reference. | +| `AssetVersion` | Traceable version of content, metadata, relationships, lifecycle, or derived output. | +| `MetadataRecord` | Standard and custom metadata with provenance and confidence where relevant. | +| `Classification` | Type, topic, sensitivity, lifecycle, operational category, review state. | +| `ContextEntity` | Person, team, project, case, customer, product, process, source system, topic, or business object. | +| `Relationship` | Typed link between assets or between an asset and contextual entity. | +| `Actor` | Human, application, automation, service account, or AI agent identity. | +| `OperationContext` | Actor, delegated identity, correlation ID, request scope, policy scope, and operation metadata. | +| `PolicyDecision` | Allow, deny, redact, require review, dry-run only, or fail-closed result. | +| `AuditEvent` | Material operation record with actor, target, operation, outcome, correlation ID, policy context. | +| `IngestionJob` | Observable ingestion request and status. | +| `TransformationRun` | Traceable operation over assets producing derived artifacts. | +| `WorkflowTemplate` | Reusable workflow definition with steps, dependencies, inputs, outputs, and policy. | +| `WorkflowRun` | Executed workflow instance and step state. | +| `ExportPackage` | Governed export with manifest, integrity data, and selected records. | + +The old `Artifact` vocabulary can map to `KnowledgeAsset` and +`AssetRepresentation`. The old `Collection` vocabulary remains useful as an +organizational container, but assets should eventually support multiple +collections/scopes where needed. + +## Application Services + +Application services coordinate domain rules and ports. They should be thin but +not anemic; this is where operation ordering, policy checks, audit emission, and +repository updates meet. + +Every material service method should follow this pattern: + +```text +validate input +resolve actor and operation context +load required state +authorize through policy port +perform deterministic domain change or submit job +persist changes +emit audit and events +return typed result or structured error +``` + +Suggested service boundaries: + +- `AssetService`: create, retrieve, update, retire, delete request, metadata, + classification, versioning, relationship changes. +- `IngestionService`: submit ingestion, run extraction, validate normalized + output, quarantine failures, reconcile re-ingestion. +- `RetrievalService`: query, text search, filters, context graph retrieval, + snippets, permission filtering, feedback. +- `TransformationService`: operation registry, transformation runs, derived + artifacts, lineage, review requirements. +- `WorkflowService`: workflow templates, run execution, retries, cancel, + resume, exception queues, human tasks. +- `AgentService`: bounded agent operations, context packages, dry runs, review + gates, agent audit. +- `ExportService`: package selection, manifest, integrity validation, + permission-aware export. + +## Ports And Adapters + +Ports are stable interfaces owned by the engine. Adapters are replaceable +implementations. + +Required MVP ports: + +- Repository port for assets, representations, metadata, relationships, + versions, runs, audit events, and exports. +- Object/content store port for source, normalized, and derived content payloads. +- Search index port for lexical search and later semantic/hybrid retrieval. +- Extractor port for format-specific normalization. +- Connector port for source systems. +- Policy decision port for authorization and review requirements. +- Event publisher port for observability, webhooks, and integration. +- AI/model port for provider-neutral summarization, classification, extraction, + embedding, or generation when enabled. + +Adapter rules: + +- `markitect-tool` is an adapter for markdown syntax and context-package + interoperability. +- `llm-connect` or equivalent is an adapter for LLM providers. +- `phase-memory` is an adjacent memory runtime; this engine may exchange opaque + memory references or context packages but should not implement memory phases. +- SQLite is an MVP repository adapter, not the domain model. +- Semantic/vector search is an optional retrieval adapter, not the definition of + retrieval. + +## Persistence Blueprint + +Use SQLite first for local-first durability and tests that prove state survives +repository re-instantiation. + +Core tables should map to stable domain concepts: + +- assets, +- source references, +- representations, +- metadata records, +- classifications, +- contextual entities, +- relationships, +- versions, +- change records, +- actors, +- policy assignments or policy references, +- audit events, +- ingestion jobs, +- transformation runs, +- workflow templates, +- workflow runs and step runs, +- export packages and manifests. + +Recommended storage style: + +- Relational columns for identifiers, types, status, timestamps, digests, + foreign keys, and lifecycle fields. +- JSON columns for flexible metadata, extractor details, policy context, and + adapter-specific payloads. +- Separate content/object references for large source, normalized, or derived + payloads. +- Append-only audit events and change records. +- Deterministic ordering fields for pagination and tests. + +Do not store permission-sensitive content in search indexes unless the retrieval +layer can enforce permissions before exposing results. + +## Retrieval Blueprint + +MVP retrieval should be useful before semantic search: + +1. Retrieve by stable asset ID. +2. Filter by metadata, classification, lifecycle, source, collection, and time. +3. Search normalized text lexically. +4. Retrieve by relationship and contextual entity. +5. Return source-grounded snippets and explanation data. +6. Enforce permissions before returning content, snippets, or relationship data. +7. Capture feedback and quality signals. + +Later retrieval can add: + +- semantic/vector search, +- hybrid ranking, +- facets and aggregations, +- grounded answer packages, +- federated external-source retrieval. + +The retrieval contract should not expose backend-specific ranking internals as +stable API. + +## Workflow And Transformation Blueprint + +Transformations and workflows should share a common run model. + +Transformation: + +```text +source assets + versions + parameters + actor + policy context + -> transformation run + -> derived artifact representation + -> lineage + audit + event +``` + +Workflow: + +```text +template + inputs + actor + trigger + -> workflow run + -> step runs + -> assets / metadata / relationships / derived artifacts / review tasks + -> audit + events + metrics +``` + +MVP execution can be embedded and synchronous/asynchronous-lite. The contracts +should still allow later replacement with a queue or external workflow engine. + +Operation states should include queued, running, waiting, completed, failed, +partially completed, retried, canceled, quarantined, and review required where +applicable. + +## Policy, Governance, And Audit Blueprint + +Policy is part of the core operating model, not a UI feature. + +Policy inputs: + +- actor and delegated actor, +- role and group membership, +- operation type, +- source-system permission context, +- sensitivity, +- lifecycle state, +- review state, +- asset policy, +- workflow state, +- requested output or export scope. + +Policy outcomes: + +- allow, +- deny, +- redact, +- require review, +- dry-run only, +- fail closed. + +Audit should record material operations: + +- asset creation and updates, +- ingestion, +- metadata and classification changes, +- relationship changes, +- permission or policy changes, +- retrieval/query where configured, +- transformation runs, +- workflow actions, +- export, +- agent operations, +- administrative recovery actions. + +## Agent-Safe Operation Blueprint + +Agents are actors with explicit scope. They must not receive implicit privileged +access. + +Agent operations should be listed in a bounded catalog: + +- inspect asset, +- search assets, +- retrieve permitted snippets, +- assemble context package, +- propose metadata enrichment, +- propose classification, +- request transformation, +- invoke workflow, +- submit review result, +- dry-run change, +- report generated output. + +Each operation declares: + +- input schema, +- output schema, +- required permissions, +- policy checks, +- audit behavior, +- review-gate behavior, +- failure modes, +- whether dry-run is supported. + +Context packages should contain selected assets, snippets, metadata, +relationships, provenance, task instructions, and policy constraints. They +should be inspectable and bounded; they are not a back door to unrestricted +repository access. + +## Service API Blueprint + +The FastAPI service should be an adapter over application services. + +Endpoint groups: + +- `/v1/assets` +- `/v1/metadata` +- `/v1/relationships` +- `/v1/ingestion/jobs` +- `/v1/retrieval/query` +- `/v1/transformations` +- `/v1/workflows` +- `/v1/audit` +- `/v1/policies` +- `/v1/agents/operations` +- `/v1/context-packages` +- `/v1/exports` +- `/v1/admin` +- `/health`, `/ready`, `/version` + +API DTOs may differ from domain objects. Keep mapping explicit so the domain can +evolve without leaking internal storage shape. + +## Observability And Export Blueprint + +Observability must cover both system operation and product quality. + +Operational signals: + +- ingestion throughput, +- source-update-to-index latency, +- query latency, +- API latency, +- workflow completion rate, +- job failure rate, +- queue age, +- storage/index health, +- policy failures, +- audit completeness. + +Quality signals: + +- retrieval precision hooks, +- zero-result rate, +- low-confidence result rate, +- citation precision, +- unsupported-claim rate where AI adapters are used, +- manual correction rate, +- review turnaround time. + +Export packages should include: + +- selected assets, +- source and normalized representations where policy permits, +- metadata, +- relationships, +- provenance, +- versions, +- audit references, +- derived artifacts, +- manifest, +- schema version, +- checksums, +- actor and policy context. + +## Implementation Sequence + +1. Finish `KONT-WP-0004` by turning this blueprint into concrete ADRs and + module migration decisions. +2. Build `KONT-WP-0005` first as the governed asset registry foundation. +3. Add ingestion jobs and source/format adapters in `KONT-WP-0006`. +4. Build permission-aware retrieval and context graph behavior in + `KONT-WP-0007`. +5. Add transformations, derived artifacts, and workflow jobs in + `KONT-WP-0008`. +6. Expose service and agent-safe APIs in `KONT-WP-0009`. +7. Add observability, export, and enterprise-readiness surfaces in + `KONT-WP-0010`. + +## Review Checklist + +Use this checklist before accepting significant implementation changes: + +- Does the change preserve stable asset identity? +- Does it distinguish source, normalized, and derived representations? +- Does every material operation have actor context? +- Are permission checks applied before content exposure or mutation? +- Are audit events emitted or explicitly deferred? +- Are errors structured and traceable with correlation IDs? +- Does the code depend inward on domain contracts rather than outward on + infrastructure? +- Is the extension point a port owned by the engine? +- Can the feature work with in-memory tests and a durable backend? +- Can an agent use the feature only through explicit bounded operations? diff --git a/docs/knowledge-operations-roadmap.md b/docs/knowledge-operations-roadmap.md index 9248df4..db7a04d 100644 --- a/docs/knowledge-operations-roadmap.md +++ b/docs/knowledge-operations-roadmap.md @@ -44,6 +44,9 @@ The strongest implementation wedge is: - Treat identity, provenance, permission checks, audit, and structured errors as P0 infrastructure, not enterprise-only additions. +- Use `docs/architecture-blueprint.md` as the implementation shape for domain + core, application services, ports, adapters, persistence, retrieval, + workflow, service APIs, and agent-safe operation. - Separate source, normalized, and derived representations. - Keep transformations traceable to inputs, versions, parameters, actor, policy, and output artifacts. @@ -57,7 +60,7 @@ The strongest implementation wedge is: | Workplan | Role | Primary Coverage | | --- | --- | --- | -| `KONT-WP-0004` | Architecture rebase | Resolve open product/architecture decisions and publish V0.2 traceability. | +| `KONT-WP-0004` | Architecture rebase | Establish the blueprint, resolve open product/architecture decisions, and publish V0.2 traceability. | | `KONT-WP-0005` | Asset registry core | Stable identity, source/normalized/derived forms, metadata, permissions, audit, durable state. | | `KONT-WP-0006` | Ingestion core | Jobs, connectors, extractors, local files, markdown, PDFs, office docs, datasets, normalization. | | `KONT-WP-0007` | Retrieval core | Query contracts, lexical search, filters, context graph retrieval, permission-aware results, snippets, KPIs. | diff --git a/workplans/KONT-WP-0004-knowledge-operations-architecture.md b/workplans/KONT-WP-0004-knowledge-operations-architecture.md index d24ed02..7b1552a 100644 --- a/workplans/KONT-WP-0004-knowledge-operations-architecture.md +++ b/workplans/KONT-WP-0004-knowledge-operations-architecture.md @@ -31,6 +31,8 @@ workflow state, exportability, and agent-safe operation from the start. ## Outputs - Updated scope and roadmap documentation. +- `docs/architecture-blueprint.md` as the architecture baseline for the V0.2 + implementation sequence. - Architecture decision notes for the P0 capability baseline. - Traceability from PRD/FRS V0.2 requirements to implementation workplans. - Revised implementation sequence for `KONT-WP-0005` through `KONT-WP-0010`. @@ -188,12 +190,17 @@ Acceptance: - `SCOPE.md` reflects the V0.2 knowledge operations vision. - `docs/knowledge-operations-roadmap.md` maps PRD/FRS areas to workplans. +- `docs/architecture-blueprint.md` defines the implementation shape and review + checklist. - `README.md` points to the new research and roadmap materials. ## Definition Of Done - Architecture docs clearly distinguish engine, application, connector, provider, and domain-package responsibilities. +- Architecture docs define domain core, application service, port, adapter, + persistence, retrieval, workflow, policy, audit, service API, export, and + agent-safe operation boundaries. - Workplans `KONT-WP-0005` through `KONT-WP-0010` exist and are linked to State Hub. - `python3 -m pytest` passes. diff --git a/workplans/KONT-WP-0005-asset-registry-governance-state.md b/workplans/KONT-WP-0005-asset-registry-governance-state.md index 4f20536..da398ca 100644 --- a/workplans/KONT-WP-0005-asset-registry-governance-state.md +++ b/workplans/KONT-WP-0005-asset-registry-governance-state.md @@ -30,6 +30,13 @@ FR-140 to FR-145, FR-240 to FR-245. Supporting: FR-180 to FR-182, FR-200 to FR-201. +## Architecture Constraint + +Implement this slice through the domain core, application services, repository +ports, policy port, audit port, and SQLite/in-memory adapters described in +`docs/architecture-blueprint.md`. The asset registry must not depend on HTTP, +source connectors, document extractors, search backends, or AI providers. + ## G5.1 - Implement stable asset identity and source references ```task @@ -170,4 +177,6 @@ Acceptance: - Asset lifecycle tests cover create, retrieve, update, retire, delete request, metadata changes, permission checks, audit events, and durable reload. - New models map to the V0.2 FRS vocabulary. +- The implemented package shape follows `docs/architecture-blueprint.md` or + documents any deliberate deviation. - `python3 -m pytest` passes. diff --git a/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md b/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md index fc37cbd..70514b4 100644 --- a/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md +++ b/workplans/KONT-WP-0006-multi-format-ingestion-normalization.md @@ -30,6 +30,13 @@ Primary: FR-020 to FR-030. Supporting: FR-001 to FR-008, FR-022 to FR-028, FR-200 to FR-202, FR-240 to FR-244. +## Architecture Constraint + +Implement ingestion through connector and extractor ports described in +`docs/architecture-blueprint.md`. Format-specific parsing, local filesystem +access, `markitect-tool`, PDF/document libraries, and dataset readers must live +behind adapters, not in the domain core. + ## I6.1 - Implement ingestion job model status and retry surface ```task @@ -168,4 +175,5 @@ Acceptance: - Local file, text, markdown, PDF/document placeholder, and dataset ingestion scenarios are covered by tests. - Job status and provenance are inspectable through programmatic APIs. +- Connector and extractor boundaries follow `docs/architecture-blueprint.md`. - `python3 -m pytest` passes. diff --git a/workplans/KONT-WP-0007-governed-retrieval-context-graph.md b/workplans/KONT-WP-0007-governed-retrieval-context-graph.md index e757811..d0eb1ac 100644 --- a/workplans/KONT-WP-0007-governed-retrieval-context-graph.md +++ b/workplans/KONT-WP-0007-governed-retrieval-context-graph.md @@ -29,6 +29,13 @@ Primary: FR-040 to FR-050 and FR-060 to FR-071. Supporting: FR-120 to FR-126, FR-143 to FR-146, FR-163, FR-200 to FR-204. +## Architecture Constraint + +Implement retrieval through retrieval services, search ports, repository ports, +and policy checks described in `docs/architecture-blueprint.md`. Search indexes +and ranking backends are adapters; they must not define the stable query or +result contracts. + ## R7.1 - Implement query contracts pagination sorting and result envelopes ```task @@ -167,4 +174,6 @@ Acceptance: - Retrieval tests cover text, metadata, lifecycle, relationship, contextual entity, pagination, permission, snippet, and feedback behavior. - Retrieval does not bypass policy or source provenance. +- Search, relationship, and context retrieval contracts follow + `docs/architecture-blueprint.md`. - `python3 -m pytest` passes. diff --git a/workplans/KONT-WP-0008-transformations-workflow-jobs.md b/workplans/KONT-WP-0008-transformations-workflow-jobs.md index 8313a9c..641b8eb 100644 --- a/workplans/KONT-WP-0008-transformations-workflow-jobs.md +++ b/workplans/KONT-WP-0008-transformations-workflow-jobs.md @@ -30,6 +30,13 @@ Primary: FR-080 to FR-090 and FR-100 to FR-110. Supporting: FR-083 to FR-085, FR-106, FR-144 to FR-145, FR-165, FR-200 to FR-202. +## Architecture Constraint + +Implement transformations and workflows through operation registries, workflow +services, repository ports, event ports, policy checks, and audit events +described in `docs/architecture-blueprint.md`. Execution may start embedded, +but contracts must allow later queue or workflow-engine adapters. + ## O8.1 - Implement transformation operation registry ```task @@ -167,4 +174,6 @@ Acceptance: - Transformations and workflows produce inspectable run records and audit events. - Derived artifacts are persistent, governed, and lineage-linked. +- Transformation and workflow run models follow + `docs/architecture-blueprint.md`. - `python3 -m pytest` passes. diff --git a/workplans/KONT-WP-0009-service-api-agent-safe-operation.md b/workplans/KONT-WP-0009-service-api-agent-safe-operation.md index 4cd8e9d..ad3347d 100644 --- a/workplans/KONT-WP-0009-service-api-agent-safe-operation.md +++ b/workplans/KONT-WP-0009-service-api-agent-safe-operation.md @@ -31,6 +31,13 @@ Primary: FR-160 to FR-169 and FR-180 to FR-188. Supporting: FR-060 to FR-066, FR-080 to FR-085, FR-100 to FR-106, FR-120 to FR-126, FR-200 to FR-202, FR-240 to FR-245. +## Architecture Constraint + +Implement the service API as an adapter over application services, following +`docs/architecture-blueprint.md`. HTTP routes must not own domain behavior. +Agent operations must use the bounded operation catalog, policy checks, audit +events, dry-run behavior, and review gates described in the blueprint. + ## S9.1 - Implement versioned FastAPI service skeleton and health contracts ```task @@ -169,4 +176,6 @@ Acceptance: - The service API exposes the MVP operation surface without requiring UI. - Agent-safe operations are explicit, bounded, permissioned, auditable, and reviewable. +- API routes remain adapters over the core contracts described in + `docs/architecture-blueprint.md`. - `python3 -m pytest` passes. diff --git a/workplans/KONT-WP-0010-observability-export-enterprise-readiness.md b/workplans/KONT-WP-0010-observability-export-enterprise-readiness.md index 626cb13..9ab1aaa 100644 --- a/workplans/KONT-WP-0010-observability-export-enterprise-readiness.md +++ b/workplans/KONT-WP-0010-observability-export-enterprise-readiness.md @@ -31,6 +31,13 @@ Primary: FR-200 to FR-207 and FR-220 to FR-225. Supporting: FR-183 to FR-188, FR-127 to FR-132, FR-070, FR-166 to FR-168, FR-240 to FR-245. +## Architecture Constraint + +Implement observability, export, events, webhooks, and recovery through the +ports, services, audit model, and export package model described in +`docs/architecture-blueprint.md`. Export and observability must preserve policy +checks and must not require direct storage access. + ## E10.1 - Expose operational metrics events and job inspection ```task @@ -176,4 +183,6 @@ Acceptance: - Operators can inspect, diagnose, recover, export, and evaluate MVP engine behavior through supported surfaces. - Export packages preserve enough context for inspection and migration. +- Observability, events, recovery, and export follow + `docs/architecture-blueprint.md`. - `python3 -m pytest` passes.