Files
markitect-tool/docs/research-lab-cache-backend-research.md

8.4 KiB

Research Lab: Sophisticated Cache Backends

Date: 2026-05-03

Purpose

This research note explores how markitect-tool can keep its slim, markdown-native core while allowing sophisticated optional backends for cached ASTs, structured indexes, multiple query paradigms, agent working memory, and access-controlled knowledge systems.

The goal is not to rebuild markitect-main wholesale. The goal is to preserve the useful insight behind it: once Markdown has been parsed into a trustworthy structure, many higher-value operations become possible if that structure can be cached, indexed, queried, reactivated, and governed.

Research Signals

Content Addressing And Reproducibility

Git's object model is a practical reference for content-addressed storage: content is written to an object database and retrieved by a hash-derived key. Bazel remote caching similarly separates action outputs from metadata so work can be reused when inputs are unchanged.

Relevance:

  • Parse results should be keyed by content hash, parser version, and options.
  • Derived indexes should declare their input snapshots and invalidation rules.
  • Reproducible context packages need stable object identities.

Sources:

Structured Query And AST Introspection

JSONPath is now standardized as RFC 9535. It defines selection and extraction over JSON values and has security considerations around implementation behavior and query construction. This makes it a good optional backend for power users who need raw access to the full parsed structure.

SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5 supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and external-content tables. These features map well to Markdown sections and blocks while keeping local-first operation.

Relevance:

  • Keep the current simple selector API as the common surface.
  • Add JSONPath over Document.to_dict() as an optional advanced adapter.
  • Add SQLite as the first local persistent index backend.
  • Keep AST introspection as a debugging and research-lab capability, not as the default user interface.

Sources:

Columnar And Vector Backends

Apache Arrow defines a language-independent columnar memory format. DuckDB is strong for local analytical SQL over structured data. Vector databases such as Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.

Relevance:

  • The core should not depend on any vector database.
  • Index backends should advertise capabilities: text search, SQL, JSONPath, vector search, hybrid retrieval, analytical scans.
  • Vector indexes should store provenance back to document, section, and content hash, not just opaque chunks.

Sources:

Agent Context And Working Memory

The Model Context Protocol gives a useful integration model: resources provide context/data, tools execute actions, and roots define filesystem or URI boundaries. LangChain/LangGraph memory docs distinguish short-term, thread-scoped memory from long-term, namespace-scoped memory, and further split long-term memory into semantic, episodic, and procedural forms. The MemGPT paper frames memory management as an operating-system-like problem for LLMs.

Relevance:

  • Markitect context caches should be namespace-scoped and explicitly activatable.
  • A context package should carry text, structure, provenance, policy, freshness, and token-budget metadata.
  • Agents should be able to drop and reactivate working context by stable id.
  • Memory writes need hot-path and background modes.

Sources:

Provenance, Observability, And Debuggability

W3C PROV provides a vocabulary for entities, activities, agents, and derivations. OpenTelemetry traces provide spans and attributes for observing distributed or multi-step operations.

Relevance:

  • Cache entries should record what produced them.
  • Query results should be explainable: source file, section, content hash, index backend, policy decision, and transform chain.
  • Agent context packs should be auditable.

Sources:

Access Control: Fluid To Rigid

Zanzibar demonstrates a relationship-based authorization model at large scale. OpenFGA and SpiceDB make Zanzibar-style relationship-based access control available as productized systems. OPA/Rego and Cedar provide policy evaluation models for attribute and rule-based decisions.

Relevance:

  • Markitect should support a fluid-to-rigid access-control ladder.
  • Local labs can start with labels and trust scopes.
  • Secure deployments need policy checks before query results are returned to agents or users.
  • Policy decisions should be part of the diagnostic and provenance trail.

Sources:

Main Finding

The optional backend should be a capability-oriented cache fabric, not a single database choice.

The slim core should continue to parse, validate, query, transform, and generate Markdown without persistent infrastructure. The research-lab backend should attach through explicit interfaces:

  • content-addressed snapshots
  • index manifests
  • query adapter registry
  • memory/context package registry
  • access policy gateway
  • provenance and trace records

That lets the project support spontaneous one-time tool use and also grow into high-performance, agentic, security-sensitive knowledge systems.

Most Promising Use Cases

UC-RL-001: AST Introspection And JSONPath Backend

Expose raw parsed documents for advanced users:

  • mkt ast show
  • mkt ast query --backend jsonpath
  • raw token and inline query support
  • adapter path from simple selectors to JSONPath where possible

Utility:

  • debugging parser behavior
  • developing transforms
  • power-user structural extraction
  • migration path for legacy markitect-main AST workflows

UC-RL-002: Local Persistent Knowledge Index

Build a local cache/index for a repo or document collection:

  • content-addressed document snapshots
  • SQLite JSON tables for structure
  • SQLite FTS5 for section/block text search
  • optional DuckDB/Arrow export for analytical work
  • incremental refresh based on content hashes

Utility:

  • fast repeated queries
  • search across many Markdown files
  • offline/local-first knowledge work
  • foundation for batch transforms and generation pipelines

UC-RL-003: Agent Working Memory Cache

Create activatable context packages for LLM agents:

  • namespace-scoped memories
  • short-term working sets and long-term caches
  • semantic/episodic/procedural memory categories
  • drop/reactivate by stable id
  • token-budget-aware context assembly
  • provenance and freshness metadata

Utility:

  • efficient agent work across long projects
  • reusable context packs for recurring tasks
  • controlled memory updates and recall
  • bridge from Markitect documents to agent infrastructure

UC-RL-004: Access-Controlled Knowledge Gateway

Add policy enforcement to cached retrieval:

  • labels/trust zones for local use
  • ACL/ReBAC/ABAC adapters for stricter systems
  • policy-aware query result filtering
  • decision logs and diagnostics
  • secure context packages for LLM use

Utility:

  • enterprise and IT-security use cases
  • multi-tenant knowledge bases
  • agent access control
  • auditable data exposure

Design Principles

  • The core remains infrastructure-free.
  • Backends are optional and capability-declared.
  • Every cached object is content-addressed or provenance-addressed.
  • Query adapters return the same match/result envelope.
  • Policy is checked before data leaves a backend boundary.
  • Context packages are explicit, droppable, and reactivatable.
  • LLM memory is data with provenance, not invisible prompt residue.
  • Experimental backends belong behind stable contracts.