Files
markitect-tool/docs/cache-backend-architecture-blueprint.md

260 lines
5.7 KiB
Markdown

# Cache Backend Architecture Blueprint
Date: 2026-05-03
## Purpose
This blueprint defines an optional backend architecture for sophisticated
knowledge systems built on top of `markitect-tool`.
It is a research-lab architecture: powerful enough to support cached ASTs,
advanced query backends, agent memory, and access control, but separated from
the slim core so one-off CLI use stays fast and simple.
## Architectural Boundary
The core package owns:
- Markdown parsing
- document contracts
- simple selectors
- deterministic transforms and generation primitives
- unified diagnostics
The optional backend fabric owns:
- persistent snapshots
- indexes
- advanced query adapters
- memory/context packages
- policy enforcement
- provenance records
- trace and performance metadata
The core must be able to run without the backend fabric.
## Conceptual Layers
```text
Markdown files
-> Core parser and contract layer
-> Content-addressed document snapshots
-> Index fabric
-> AST/JSON index
-> full-text index
-> vector/semantic index
-> analytical/index export
-> Query adapter registry
-> simple selectors
-> JSONPath
-> SQL/FTS
-> vector/hybrid retrieval
-> Context package registry
-> activated working sets
-> memory namespaces
-> agent-ready context bundles
-> Access policy gateway
-> labels/ACL/ReBAC/ABAC
-> result filtering and denial diagnostics
-> Provenance and observability
```
## Core Interfaces
### Snapshot Backend
Responsible for durable parsed-document snapshots.
Minimum protocol:
```text
put_document(source_path, content, parse_options) -> snapshot_id
get_snapshot(snapshot_id) -> DocumentSnapshot
resolve_source(source_path) -> latest snapshot_id
diff_snapshot(old_id, new_id) -> SnapshotDiff
```
Snapshot identity should include:
- source content hash
- parser version
- parse options
- contract version when relevant
### Index Backend
Responsible for derived lookup structures.
Minimum protocol:
```text
capabilities() -> IndexCapabilities
build(snapshot_ids, options) -> IndexBuildResult
refresh(changed_snapshots) -> IndexBuildResult
query(request) -> QueryResult
explain(request) -> QueryPlan
```
Capabilities should include:
- `jsonpath`
- `sql`
- `fts`
- `vector`
- `hybrid`
- `inline_tokens`
- `section_graph`
- `policy_pushdown`
### Query Adapter
Translates a stable Markitect query request into backend-specific execution.
Minimum protocol:
```text
name
supports(selector_or_query, target) -> bool
execute(document_or_backend, request) -> QueryResult
explain(request) -> QueryExplanation
```
Adapters must return a common result envelope:
- kind
- path
- value
- text
- source location
- snapshot id
- provenance
- policy decision
- backend metadata
### Context Package Registry
Responsible for agent-ready working memory.
Minimum protocol:
```text
create_package(query_or_manifest, budget, policy) -> context_package_id
activate(package_id, thread_or_workspace) -> activation_id
deactivate(activation_id)
refresh(package_id) -> package_id
explain(package_id) -> ContextPackageReport
```
Context packages should include:
- included source spans
- summary layers
- token estimates
- provenance
- freshness
- policy labels
- retrieval recipe
- cache keys
### Access Policy Gateway
Responsible for authorization and redaction before results leave a backend.
Minimum protocol:
```text
authorize(subject, action, object, context) -> PolicyDecision
filter_results(subject, action, results, context) -> FilteredResults
explain_decision(decision_id) -> PolicyExplanation
```
Policy should support a ladder:
1. Labels and trust zones.
2. File/path ACLs.
3. Relationship-based access control.
4. Attribute/rule-based policies.
5. External authorization services.
## Suggested Backend Manifest
Backends should register through a Markdown/YAML manifest:
````markdown
# Local SQLite Backend
```yaml markitect-backend
id: local-sqlite-cache
kind: cache-backend
capabilities:
- snapshots
- json
- fts
- sql
- provenance
storage:
engine: sqlite
path: .markitect/cache/index.sqlite
policy:
mode: labels
```
````
## CLI Direction
The first backend CLI should be explicit:
```text
mkt cache init
mkt cache build <path>
mkt cache status
mkt cache query <selector-or-query> --backend <name>
mkt ast show <file>
mkt ast query <file> <jsonpath>
mkt context pack <manifest-or-query>
mkt context activate <package-id>
mkt policy check <subject> <action> <object>
```
Do not hide persistence behind `mkt query`. The user should know when the tool
is querying live files versus a persistent backend.
## Recommended First Stack
Start with:
- content hashes in Python standard library
- SQLite for snapshot metadata, JSON, and FTS5
- JSONPath as an optional extra
- local filesystem cache directory
- simple label policy
- provenance tables
Defer:
- vector search until text/structure cache works
- external authorization engines until local policy model is stable
- MCP server exposure until resources/tools are secure and explainable
- distributed cache until local invalidation is boring
## Security Notes
Cached data becomes a new data exposure surface.
Minimum requirements before secure use:
- cache location is explicit
- cache entries know source path and content hash
- policy mode is visible
- query results report policy filtering
- context packages list what they include
- destructive cache operations require explicit command
- no backend silently sends document content to a network service
## Architecture Decision
Implement the backend fabric after deterministic transform/composition
primitives are underway, but before serious caching, agent memory, or advanced
query backends. This lets WP-0003 continue while reserving a clean path for the
research-lab track.