# Research Lab: Sophisticated Cache Backends Date: 2026-05-03 ## Purpose This research note explores how `markitect-tool` can keep its slim, markdown-native core while allowing sophisticated optional backends for cached ASTs, structured indexes, multiple query paradigms, agent working memory, and access-controlled knowledge systems. The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve the useful insight behind it: once Markdown has been parsed into a trustworthy structure, many higher-value operations become possible if that structure can be cached, indexed, queried, reactivated, and governed. ## Research Signals ### Content Addressing And Reproducibility Git's object model is a practical reference for content-addressed storage: content is written to an object database and retrieved by a hash-derived key. Bazel remote caching similarly separates action outputs from metadata so work can be reused when inputs are unchanged. Relevance: - Parse results should be keyed by content hash, parser version, and options. - Derived indexes should declare their input snapshots and invalidation rules. - Reproducible context packages need stable object identities. Sources: - https://git-scm.com/book/en/v2/Git-Internals-Git-Objects - https://docs.bazel.build/versions/main/remote-caching.html ### Structured Query And AST Introspection JSONPath is now standardized as RFC 9535. It defines selection and extraction over JSON values and has security considerations around implementation behavior and query construction. This makes it a good optional backend for power users who need raw access to the full parsed structure. SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5 supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and external-content tables. These features map well to Markdown sections and blocks while keeping local-first operation. Relevance: - Keep the current simple selector API as the common surface. - Add JSONPath over `Document.to_dict()` as an optional advanced adapter. - Add SQLite as the first local persistent index backend. - Keep AST introspection as a debugging and research-lab capability, not as the default user interface. Sources: - https://www.rfc-editor.org/rfc/rfc9535.html - https://www.sqlite.org/json1.html - https://www.sqlite.org/fts5.html ### Columnar And Vector Backends Apache Arrow defines a language-independent columnar memory format. DuckDB is strong for local analytical SQL over structured data. Vector databases such as Qdrant, LanceDB, and pgvector provide semantic retrieval primitives. Relevance: - The core should not depend on any vector database. - Index backends should advertise capabilities: text search, SQL, JSONPath, vector search, hybrid retrieval, analytical scans. - Vector indexes should store provenance back to document, section, and content hash, not just opaque chunks. Sources: - https://arrow.apache.org/docs/format/Columnar.html - https://duckdb.org/docs/stable/data/json/overview - https://qdrant.tech/documentation/manage-data/collections/ - https://docs.lancedb.com/ - https://github.com/pgvector/pgvector ### Agent Context And Working Memory The Model Context Protocol gives a useful integration model: resources provide context/data, tools execute actions, and roots define filesystem or URI boundaries. LangChain/LangGraph memory docs distinguish short-term, thread-scoped memory from long-term, namespace-scoped memory, and further split long-term memory into semantic, episodic, and procedural forms. The MemGPT paper frames memory management as an operating-system-like problem for LLMs. Relevance: - Markitect context caches should be namespace-scoped and explicitly activatable. - A context package should carry text, structure, provenance, policy, freshness, and token-budget metadata. - Agents should be able to drop and reactivate working context by stable id. - Memory writes need hot-path and background modes. Sources: - https://modelcontextprotocol.io/specification/2025-06-18 - https://docs.langchain.com/oss/python/concepts/memory - https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/ - https://arxiv.org/abs/2310.08560 ### Provenance, Observability, And Debuggability W3C PROV provides a vocabulary for entities, activities, agents, and derivations. OpenTelemetry traces provide spans and attributes for observing distributed or multi-step operations. Relevance: - Cache entries should record what produced them. - Query results should be explainable: source file, section, content hash, index backend, policy decision, and transform chain. - Agent context packs should be auditable. Sources: - https://www.w3.org/TR/prov-overview/ - https://opentelemetry.io/docs/concepts/signals/traces/ ### Access Control: Fluid To Rigid Zanzibar demonstrates a relationship-based authorization model at large scale. OpenFGA and SpiceDB make Zanzibar-style relationship-based access control available as productized systems. OPA/Rego and Cedar provide policy evaluation models for attribute and rule-based decisions. Relevance: - Markitect should support a fluid-to-rigid access-control ladder. - Local labs can start with labels and trust scopes. - Secure deployments need policy checks before query results are returned to agents or users. - Policy decisions should be part of the diagnostic and provenance trail. Sources: - https://www.usenix.org/conference/atc19/presentation/pang - https://openfga.dev/docs/concepts - https://www.openpolicyagent.org/docs/policy-language - https://docs.cedarpolicy.com/ ## Main Finding The optional backend should be a **capability-oriented cache fabric**, not a single database choice. The slim core should continue to parse, validate, query, transform, and generate Markdown without persistent infrastructure. The research-lab backend should attach through explicit interfaces: - content-addressed snapshots - index manifests - query adapter registry - memory/context package registry - access policy gateway - provenance and trace records That lets the project support spontaneous one-time tool use and also grow into high-performance, agentic, security-sensitive knowledge systems. ## Most Promising Use Cases ### UC-RL-001: AST Introspection And JSONPath Backend Expose raw parsed documents for advanced users: - `mkt ast show` - `mkt ast query --backend jsonpath` - raw token and inline query support - adapter path from simple selectors to JSONPath where possible Utility: - debugging parser behavior - developing transforms - power-user structural extraction - migration path for legacy `markitect-main` AST workflows ### UC-RL-002: Local Persistent Knowledge Index Build a local cache/index for a repo or document collection: - content-addressed document snapshots - SQLite JSON tables for structure - SQLite FTS5 for section/block text search - optional DuckDB/Arrow export for analytical work - incremental refresh based on content hashes Utility: - fast repeated queries - search across many Markdown files - offline/local-first knowledge work - foundation for batch transforms and generation pipelines ### UC-RL-003: Agent Working Memory Cache Create activatable context packages for LLM agents: - namespace-scoped memories - short-term working sets and long-term caches - semantic/episodic/procedural memory categories - drop/reactivate by stable id - token-budget-aware context assembly - provenance and freshness metadata Utility: - efficient agent work across long projects - reusable context packs for recurring tasks - controlled memory updates and recall - bridge from Markitect documents to agent infrastructure ### UC-RL-004: Access-Controlled Knowledge Gateway Add policy enforcement to cached retrieval: - labels/trust zones for local use - ACL/ReBAC/ABAC adapters for stricter systems - policy-aware query result filtering - decision logs and diagnostics - secure context packages for LLM use Utility: - enterprise and IT-security use cases - multi-tenant knowledge bases - agent access control - auditable data exposure ## Design Principles - The core remains infrastructure-free. - Backends are optional and capability-declared. - Every cached object is content-addressed or provenance-addressed. - Query adapters return the same match/result envelope. - Policy is checked before data leaves a backend boundary. - Context packages are explicit, droppable, and reactivatable. - LLM memory is data with provenance, not invisible prompt residue. - Experimental backends belong behind stable contracts.