8.4 KiB
Research Lab: Sophisticated Cache Backends
Date: 2026-05-03
Purpose
This research note explores how markitect-tool can keep its slim,
markdown-native core while allowing sophisticated optional backends for cached
ASTs, structured indexes, multiple query paradigms, agent working memory, and
access-controlled knowledge systems.
The goal is not to rebuild markitect-main wholesale. The goal is to preserve
the useful insight behind it: once Markdown has been parsed into a trustworthy
structure, many higher-value operations become possible if that structure can
be cached, indexed, queried, reactivated, and governed.
Research Signals
Content Addressing And Reproducibility
Git's object model is a practical reference for content-addressed storage: content is written to an object database and retrieved by a hash-derived key. Bazel remote caching similarly separates action outputs from metadata so work can be reused when inputs are unchanged.
Relevance:
- Parse results should be keyed by content hash, parser version, and options.
- Derived indexes should declare their input snapshots and invalidation rules.
- Reproducible context packages need stable object identities.
Sources:
- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
- https://docs.bazel.build/versions/main/remote-caching.html
Structured Query And AST Introspection
JSONPath is now standardized as RFC 9535. It defines selection and extraction over JSON values and has security considerations around implementation behavior and query construction. This makes it a good optional backend for power users who need raw access to the full parsed structure.
SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5 supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and external-content tables. These features map well to Markdown sections and blocks while keeping local-first operation.
Relevance:
- Keep the current simple selector API as the common surface.
- Add JSONPath over
Document.to_dict()as an optional advanced adapter. - Add SQLite as the first local persistent index backend.
- Keep AST introspection as a debugging and research-lab capability, not as the default user interface.
Sources:
- https://www.rfc-editor.org/rfc/rfc9535.html
- https://www.sqlite.org/json1.html
- https://www.sqlite.org/fts5.html
Columnar And Vector Backends
Apache Arrow defines a language-independent columnar memory format. DuckDB is strong for local analytical SQL over structured data. Vector databases such as Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
Relevance:
- The core should not depend on any vector database.
- Index backends should advertise capabilities: text search, SQL, JSONPath, vector search, hybrid retrieval, analytical scans.
- Vector indexes should store provenance back to document, section, and content hash, not just opaque chunks.
Sources:
- https://arrow.apache.org/docs/format/Columnar.html
- https://duckdb.org/docs/stable/data/json/overview
- https://qdrant.tech/documentation/manage-data/collections/
- https://docs.lancedb.com/
- https://github.com/pgvector/pgvector
Agent Context And Working Memory
The Model Context Protocol gives a useful integration model: resources provide context/data, tools execute actions, and roots define filesystem or URI boundaries. LangChain/LangGraph memory docs distinguish short-term, thread-scoped memory from long-term, namespace-scoped memory, and further split long-term memory into semantic, episodic, and procedural forms. The MemGPT paper frames memory management as an operating-system-like problem for LLMs.
Relevance:
- Markitect context caches should be namespace-scoped and explicitly activatable.
- A context package should carry text, structure, provenance, policy, freshness, and token-budget metadata.
- Agents should be able to drop and reactivate working context by stable id.
- Memory writes need hot-path and background modes.
Sources:
- https://modelcontextprotocol.io/specification/2025-06-18
- https://docs.langchain.com/oss/python/concepts/memory
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
- https://arxiv.org/abs/2310.08560
Provenance, Observability, And Debuggability
W3C PROV provides a vocabulary for entities, activities, agents, and derivations. OpenTelemetry traces provide spans and attributes for observing distributed or multi-step operations.
Relevance:
- Cache entries should record what produced them.
- Query results should be explainable: source file, section, content hash, index backend, policy decision, and transform chain.
- Agent context packs should be auditable.
Sources:
Access Control: Fluid To Rigid
Zanzibar demonstrates a relationship-based authorization model at large scale. OpenFGA and SpiceDB make Zanzibar-style relationship-based access control available as productized systems. OPA/Rego and Cedar provide policy evaluation models for attribute and rule-based decisions.
Relevance:
- Markitect should support a fluid-to-rigid access-control ladder.
- Local labs can start with labels and trust scopes.
- Secure deployments need policy checks before query results are returned to agents or users.
- Policy decisions should be part of the diagnostic and provenance trail.
Sources:
- https://www.usenix.org/conference/atc19/presentation/pang
- https://openfga.dev/docs/concepts
- https://www.openpolicyagent.org/docs/policy-language
- https://docs.cedarpolicy.com/
Main Finding
The optional backend should be a capability-oriented cache fabric, not a single database choice.
The slim core should continue to parse, validate, query, transform, and generate Markdown without persistent infrastructure. The research-lab backend should attach through explicit interfaces:
- content-addressed snapshots
- index manifests
- query adapter registry
- memory/context package registry
- access policy gateway
- provenance and trace records
That lets the project support spontaneous one-time tool use and also grow into high-performance, agentic, security-sensitive knowledge systems.
Most Promising Use Cases
UC-RL-001: AST Introspection And JSONPath Backend
Expose raw parsed documents for advanced users:
mkt ast showmkt ast query --backend jsonpath- raw token and inline query support
- adapter path from simple selectors to JSONPath where possible
Utility:
- debugging parser behavior
- developing transforms
- power-user structural extraction
- migration path for legacy
markitect-mainAST workflows
UC-RL-002: Local Persistent Knowledge Index
Build a local cache/index for a repo or document collection:
- content-addressed document snapshots
- SQLite JSON tables for structure
- SQLite FTS5 for section/block text search
- optional DuckDB/Arrow export for analytical work
- incremental refresh based on content hashes
Utility:
- fast repeated queries
- search across many Markdown files
- offline/local-first knowledge work
- foundation for batch transforms and generation pipelines
UC-RL-003: Agent Working Memory Cache
Create activatable context packages for LLM agents:
- namespace-scoped memories
- short-term working sets and long-term caches
- semantic/episodic/procedural memory categories
- drop/reactivate by stable id
- token-budget-aware context assembly
- provenance and freshness metadata
Utility:
- efficient agent work across long projects
- reusable context packs for recurring tasks
- controlled memory updates and recall
- bridge from Markitect documents to agent infrastructure
UC-RL-004: Access-Controlled Knowledge Gateway
Add policy enforcement to cached retrieval:
- labels/trust zones for local use
- ACL/ReBAC/ABAC adapters for stricter systems
- policy-aware query result filtering
- decision logs and diagnostics
- secure context packages for LLM use
Utility:
- enterprise and IT-security use cases
- multi-tenant knowledge bases
- agent access control
- auditable data exposure
Design Principles
- The core remains infrastructure-free.
- Backends are optional and capability-declared.
- Every cached object is content-addressed or provenance-addressed.
- Query adapters return the same match/result envelope.
- Policy is checked before data leaves a backend boundary.
- Context packages are explicit, droppable, and reactivatable.
- LLM memory is data with provenance, not invisible prompt residue.
- Experimental backends belong behind stable contracts.