generated from coulomb/repo-seed
249 lines
8.4 KiB
Markdown
249 lines
8.4 KiB
Markdown
# Research Lab: Sophisticated Cache Backends
|
|
|
|
Date: 2026-05-03
|
|
|
|
## Purpose
|
|
|
|
This research note explores how `markitect-tool` can keep its slim,
|
|
markdown-native core while allowing sophisticated optional backends for cached
|
|
ASTs, structured indexes, multiple query paradigms, agent working memory, and
|
|
access-controlled knowledge systems.
|
|
|
|
The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
|
|
the useful insight behind it: once Markdown has been parsed into a trustworthy
|
|
structure, many higher-value operations become possible if that structure can
|
|
be cached, indexed, queried, reactivated, and governed.
|
|
|
|
## Research Signals
|
|
|
|
### Content Addressing And Reproducibility
|
|
|
|
Git's object model is a practical reference for content-addressed storage:
|
|
content is written to an object database and retrieved by a hash-derived key.
|
|
Bazel remote caching similarly separates action outputs from metadata so work
|
|
can be reused when inputs are unchanged.
|
|
|
|
Relevance:
|
|
|
|
- Parse results should be keyed by content hash, parser version, and options.
|
|
- Derived indexes should declare their input snapshots and invalidation rules.
|
|
- Reproducible context packages need stable object identities.
|
|
|
|
Sources:
|
|
|
|
- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
|
|
- https://docs.bazel.build/versions/main/remote-caching.html
|
|
|
|
### Structured Query And AST Introspection
|
|
|
|
JSONPath is now standardized as RFC 9535. It defines selection and extraction
|
|
over JSON values and has security considerations around implementation behavior
|
|
and query construction. This makes it a good optional backend for power users
|
|
who need raw access to the full parsed structure.
|
|
|
|
SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
|
|
supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
|
|
external-content tables. These features map well to Markdown sections and
|
|
blocks while keeping local-first operation.
|
|
|
|
Relevance:
|
|
|
|
- Keep the current simple selector API as the common surface.
|
|
- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
|
|
- Add SQLite as the first local persistent index backend.
|
|
- Keep AST introspection as a debugging and research-lab capability, not as
|
|
the default user interface.
|
|
|
|
Sources:
|
|
|
|
- https://www.rfc-editor.org/rfc/rfc9535.html
|
|
- https://www.sqlite.org/json1.html
|
|
- https://www.sqlite.org/fts5.html
|
|
|
|
### Columnar And Vector Backends
|
|
|
|
Apache Arrow defines a language-independent columnar memory format. DuckDB is
|
|
strong for local analytical SQL over structured data. Vector databases such as
|
|
Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
|
|
|
|
Relevance:
|
|
|
|
- The core should not depend on any vector database.
|
|
- Index backends should advertise capabilities: text search, SQL, JSONPath,
|
|
vector search, hybrid retrieval, analytical scans.
|
|
- Vector indexes should store provenance back to document, section, and content
|
|
hash, not just opaque chunks.
|
|
|
|
Sources:
|
|
|
|
- https://arrow.apache.org/docs/format/Columnar.html
|
|
- https://duckdb.org/docs/stable/data/json/overview
|
|
- https://qdrant.tech/documentation/manage-data/collections/
|
|
- https://docs.lancedb.com/
|
|
- https://github.com/pgvector/pgvector
|
|
|
|
### Agent Context And Working Memory
|
|
|
|
The Model Context Protocol gives a useful integration model: resources provide
|
|
context/data, tools execute actions, and roots define filesystem or URI
|
|
boundaries. LangChain/LangGraph memory docs distinguish short-term,
|
|
thread-scoped memory from long-term, namespace-scoped memory, and further split
|
|
long-term memory into semantic, episodic, and procedural forms. The MemGPT
|
|
paper frames memory management as an operating-system-like problem for LLMs.
|
|
|
|
Relevance:
|
|
|
|
- Markitect context caches should be namespace-scoped and explicitly
|
|
activatable.
|
|
- A context package should carry text, structure, provenance, policy, freshness,
|
|
and token-budget metadata.
|
|
- Agents should be able to drop and reactivate working context by stable id.
|
|
- Memory writes need hot-path and background modes.
|
|
|
|
Sources:
|
|
|
|
- https://modelcontextprotocol.io/specification/2025-06-18
|
|
- https://docs.langchain.com/oss/python/concepts/memory
|
|
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
|
|
- https://arxiv.org/abs/2310.08560
|
|
|
|
### Provenance, Observability, And Debuggability
|
|
|
|
W3C PROV provides a vocabulary for entities, activities, agents, and
|
|
derivations. OpenTelemetry traces provide spans and attributes for observing
|
|
distributed or multi-step operations.
|
|
|
|
Relevance:
|
|
|
|
- Cache entries should record what produced them.
|
|
- Query results should be explainable: source file, section, content hash,
|
|
index backend, policy decision, and transform chain.
|
|
- Agent context packs should be auditable.
|
|
|
|
Sources:
|
|
|
|
- https://www.w3.org/TR/prov-overview/
|
|
- https://opentelemetry.io/docs/concepts/signals/traces/
|
|
|
|
### Access Control: Fluid To Rigid
|
|
|
|
Zanzibar demonstrates a relationship-based authorization model at large scale.
|
|
OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
|
|
available as productized systems. OPA/Rego and Cedar provide policy evaluation
|
|
models for attribute and rule-based decisions.
|
|
|
|
Relevance:
|
|
|
|
- Markitect should support a fluid-to-rigid access-control ladder.
|
|
- Local labs can start with labels and trust scopes.
|
|
- Secure deployments need policy checks before query results are returned to
|
|
agents or users.
|
|
- Policy decisions should be part of the diagnostic and provenance trail.
|
|
|
|
Sources:
|
|
|
|
- https://www.usenix.org/conference/atc19/presentation/pang
|
|
- https://openfga.dev/docs/concepts
|
|
- https://www.openpolicyagent.org/docs/policy-language
|
|
- https://docs.cedarpolicy.com/
|
|
|
|
## Main Finding
|
|
|
|
The optional backend should be a **capability-oriented cache fabric**, not a
|
|
single database choice.
|
|
|
|
The slim core should continue to parse, validate, query, transform, and
|
|
generate Markdown without persistent infrastructure. The research-lab backend
|
|
should attach through explicit interfaces:
|
|
|
|
- content-addressed snapshots
|
|
- index manifests
|
|
- query adapter registry
|
|
- memory/context package registry
|
|
- access policy gateway
|
|
- provenance and trace records
|
|
|
|
That lets the project support spontaneous one-time tool use and also grow into
|
|
high-performance, agentic, security-sensitive knowledge systems.
|
|
|
|
## Most Promising Use Cases
|
|
|
|
### UC-RL-001: AST Introspection And JSONPath Backend
|
|
|
|
Expose raw parsed documents for advanced users:
|
|
|
|
- `mkt ast show`
|
|
- `mkt ast query --backend jsonpath`
|
|
- raw token and inline query support
|
|
- adapter path from simple selectors to JSONPath where possible
|
|
|
|
Utility:
|
|
|
|
- debugging parser behavior
|
|
- developing transforms
|
|
- power-user structural extraction
|
|
- migration path for legacy `markitect-main` AST workflows
|
|
|
|
### UC-RL-002: Local Persistent Knowledge Index
|
|
|
|
Build a local cache/index for a repo or document collection:
|
|
|
|
- content-addressed document snapshots
|
|
- SQLite JSON tables for structure
|
|
- SQLite FTS5 for section/block text search
|
|
- optional DuckDB/Arrow export for analytical work
|
|
- incremental refresh based on content hashes
|
|
|
|
Utility:
|
|
|
|
- fast repeated queries
|
|
- search across many Markdown files
|
|
- offline/local-first knowledge work
|
|
- foundation for batch transforms and generation pipelines
|
|
|
|
### UC-RL-003: Agent Working Memory Cache
|
|
|
|
Create activatable context packages for LLM agents:
|
|
|
|
- namespace-scoped memories
|
|
- short-term working sets and long-term caches
|
|
- semantic/episodic/procedural memory categories
|
|
- drop/reactivate by stable id
|
|
- token-budget-aware context assembly
|
|
- provenance and freshness metadata
|
|
|
|
Utility:
|
|
|
|
- efficient agent work across long projects
|
|
- reusable context packs for recurring tasks
|
|
- controlled memory updates and recall
|
|
- bridge from Markitect documents to agent infrastructure
|
|
|
|
### UC-RL-004: Access-Controlled Knowledge Gateway
|
|
|
|
Add policy enforcement to cached retrieval:
|
|
|
|
- labels/trust zones for local use
|
|
- ACL/ReBAC/ABAC adapters for stricter systems
|
|
- policy-aware query result filtering
|
|
- decision logs and diagnostics
|
|
- secure context packages for LLM use
|
|
|
|
Utility:
|
|
|
|
- enterprise and IT-security use cases
|
|
- multi-tenant knowledge bases
|
|
- agent access control
|
|
- auditable data exposure
|
|
|
|
## Design Principles
|
|
|
|
- The core remains infrastructure-free.
|
|
- Backends are optional and capability-declared.
|
|
- Every cached object is content-addressed or provenance-addressed.
|
|
- Query adapters return the same match/result envelope.
|
|
- Policy is checked before data leaves a backend boundary.
|
|
- Context packages are explicit, droppable, and reactivatable.
|
|
- LLM memory is data with provenance, not invisible prompt residue.
|
|
- Experimental backends belong behind stable contracts.
|