markitect-tool/docs/research-lab-cache-backend-research.md

# Research Lab: Sophisticated Cache Backends

Date: 2026-05-03

## Purpose

This research note explores how `markitect-tool` can keep its slim,
markdown-native core while allowing sophisticated optional backends for cached
ASTs, structured indexes, multiple query paradigms, agent working memory, and
access-controlled knowledge systems.

The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
the useful insight behind it: once Markdown has been parsed into a trustworthy
structure, many higher-value operations become possible if that structure can
be cached, indexed, queried, reactivated, and governed.

## Research Signals

### Content Addressing And Reproducibility

Git's object model is a practical reference for content-addressed storage:
content is written to an object database and retrieved by a hash-derived key.
Bazel remote caching similarly separates action outputs from metadata so work
can be reused when inputs are unchanged.

Relevance:

- Parse results should be keyed by content hash, parser version, and options.
- Derived indexes should declare their input snapshots and invalidation rules.
- Reproducible context packages need stable object identities.

Sources:

- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
- https://docs.bazel.build/versions/main/remote-caching.html

### Structured Query And AST Introspection

JSONPath is now standardized as RFC 9535. It defines selection and extraction
over JSON values and has security considerations around implementation behavior
and query construction. This makes it a good optional backend for power users
who need raw access to the full parsed structure.

SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
external-content tables. These features map well to Markdown sections and
blocks while keeping local-first operation.

Relevance:

- Keep the current simple selector API as the common surface.
- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
- Add SQLite as the first local persistent index backend.
- Keep AST introspection as a debugging and research-lab capability, not as
  the default user interface.

Sources:

- https://www.rfc-editor.org/rfc/rfc9535.html
- https://www.sqlite.org/json1.html
- https://www.sqlite.org/fts5.html

### Columnar And Vector Backends

Apache Arrow defines a language-independent columnar memory format. DuckDB is
strong for local analytical SQL over structured data. Vector databases such as
Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.

Relevance:

- The core should not depend on any vector database.
- Index backends should advertise capabilities: text search, SQL, JSONPath,
  vector search, hybrid retrieval, analytical scans.
- Vector indexes should store provenance back to document, section, and content
  hash, not just opaque chunks.

Sources:

- https://arrow.apache.org/docs/format/Columnar.html
- https://duckdb.org/docs/stable/data/json/overview
- https://qdrant.tech/documentation/manage-data/collections/
- https://docs.lancedb.com/
- https://github.com/pgvector/pgvector

### Agent Context And Working Memory

The Model Context Protocol gives a useful integration model: resources provide
context/data, tools execute actions, and roots define filesystem or URI
boundaries. LangChain/LangGraph memory docs distinguish short-term,
thread-scoped memory from long-term, namespace-scoped memory, and further split
long-term memory into semantic, episodic, and procedural forms. The MemGPT
paper frames memory management as an operating-system-like problem for LLMs.

Relevance:

- Markitect context caches should be namespace-scoped and explicitly
  activatable.
- A context package should carry text, structure, provenance, policy, freshness,
  and token-budget metadata.
- Agents should be able to drop and reactivate working context by stable id.
- Memory writes need hot-path and background modes.

Sources:

- https://modelcontextprotocol.io/specification/2025-06-18
- https://docs.langchain.com/oss/python/concepts/memory
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
- https://arxiv.org/abs/2310.08560

### Provenance, Observability, And Debuggability

W3C PROV provides a vocabulary for entities, activities, agents, and
derivations. OpenTelemetry traces provide spans and attributes for observing
distributed or multi-step operations.

Relevance:

- Cache entries should record what produced them.
- Query results should be explainable: source file, section, content hash,
  index backend, policy decision, and transform chain.
- Agent context packs should be auditable.

Sources:

- https://www.w3.org/TR/prov-overview/
- https://opentelemetry.io/docs/concepts/signals/traces/

### Access Control: Fluid To Rigid

Zanzibar demonstrates a relationship-based authorization model at large scale.
OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
available as productized systems. OPA/Rego and Cedar provide policy evaluation
models for attribute and rule-based decisions.

Relevance:

- Markitect should support a fluid-to-rigid access-control ladder.
- Local labs can start with labels and trust scopes.
- Secure deployments need policy checks before query results are returned to
  agents or users.
- Policy decisions should be part of the diagnostic and provenance trail.

Sources:

- https://www.usenix.org/conference/atc19/presentation/pang
- https://openfga.dev/docs/concepts
- https://www.openpolicyagent.org/docs/policy-language
- https://docs.cedarpolicy.com/

## Main Finding

The optional backend should be a **capability-oriented cache fabric**, not a
single database choice.

The slim core should continue to parse, validate, query, transform, and
generate Markdown without persistent infrastructure. The research-lab backend
should attach through explicit interfaces:

- content-addressed snapshots
- index manifests
- query adapter registry
- memory/context package registry
- access policy gateway
- provenance and trace records

That lets the project support spontaneous one-time tool use and also grow into
high-performance, agentic, security-sensitive knowledge systems.

## Most Promising Use Cases

### UC-RL-001: AST Introspection And JSONPath Backend

Expose raw parsed documents for advanced users:

- `mkt ast show`
- `mkt ast query --backend jsonpath`
- raw token and inline query support
- adapter path from simple selectors to JSONPath where possible

Utility:

- debugging parser behavior
- developing transforms
- power-user structural extraction
- migration path for legacy `markitect-main` AST workflows

### UC-RL-002: Local Persistent Knowledge Index

Build a local cache/index for a repo or document collection:

- content-addressed document snapshots
- SQLite JSON tables for structure
- SQLite FTS5 for section/block text search
- optional DuckDB/Arrow export for analytical work
- incremental refresh based on content hashes

Utility:

- fast repeated queries
- search across many Markdown files
- offline/local-first knowledge work
- foundation for batch transforms and generation pipelines

### UC-RL-003: Agent Working Memory Cache

Create activatable context packages for LLM agents:

- namespace-scoped memories
- short-term working sets and long-term caches
- semantic/episodic/procedural memory categories
- drop/reactivate by stable id
- token-budget-aware context assembly
- provenance and freshness metadata

Utility:

- efficient agent work across long projects
- reusable context packs for recurring tasks
- controlled memory updates and recall
- bridge from Markitect documents to agent infrastructure

### UC-RL-004: Access-Controlled Knowledge Gateway

Add policy enforcement to cached retrieval:

- labels/trust zones for local use
- ACL/ReBAC/ABAC adapters for stricter systems
- policy-aware query result filtering
- decision logs and diagnostics
- secure context packages for LLM use

Utility:

- enterprise and IT-security use cases
- multi-tenant knowledge bases
- agent access control
- auditable data exposure

## Design Principles

- The core remains infrastructure-free.
- Backends are optional and capability-declared.
- Every cached object is content-addressed or provenance-addressed.
- Query adapters return the same match/result envelope.
- Policy is checked before data leaves a backend boundary.
- Context packages are explicit, droppable, and reactivatable.
- LLM memory is data with provenance, not invisible prompt residue.
- Experimental backends belong behind stable contracts.