generated from coulomb/repo-seed
Workplan dependencies and prio for text research lab workplans
This commit is contained in:
248
docs/research-lab-cache-backend-research.md
Normal file
248
docs/research-lab-cache-backend-research.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# Research Lab: Sophisticated Cache Backends
|
||||
|
||||
Date: 2026-05-03
|
||||
|
||||
## Purpose
|
||||
|
||||
This research note explores how `markitect-tool` can keep its slim,
|
||||
markdown-native core while allowing sophisticated optional backends for cached
|
||||
ASTs, structured indexes, multiple query paradigms, agent working memory, and
|
||||
access-controlled knowledge systems.
|
||||
|
||||
The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
|
||||
the useful insight behind it: once Markdown has been parsed into a trustworthy
|
||||
structure, many higher-value operations become possible if that structure can
|
||||
be cached, indexed, queried, reactivated, and governed.
|
||||
|
||||
## Research Signals
|
||||
|
||||
### Content Addressing And Reproducibility
|
||||
|
||||
Git's object model is a practical reference for content-addressed storage:
|
||||
content is written to an object database and retrieved by a hash-derived key.
|
||||
Bazel remote caching similarly separates action outputs from metadata so work
|
||||
can be reused when inputs are unchanged.
|
||||
|
||||
Relevance:
|
||||
|
||||
- Parse results should be keyed by content hash, parser version, and options.
|
||||
- Derived indexes should declare their input snapshots and invalidation rules.
|
||||
- Reproducible context packages need stable object identities.
|
||||
|
||||
Sources:
|
||||
|
||||
- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
|
||||
- https://docs.bazel.build/versions/main/remote-caching.html
|
||||
|
||||
### Structured Query And AST Introspection
|
||||
|
||||
JSONPath is now standardized as RFC 9535. It defines selection and extraction
|
||||
over JSON values and has security considerations around implementation behavior
|
||||
and query construction. This makes it a good optional backend for power users
|
||||
who need raw access to the full parsed structure.
|
||||
|
||||
SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
|
||||
supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
|
||||
external-content tables. These features map well to Markdown sections and
|
||||
blocks while keeping local-first operation.
|
||||
|
||||
Relevance:
|
||||
|
||||
- Keep the current simple selector API as the common surface.
|
||||
- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
|
||||
- Add SQLite as the first local persistent index backend.
|
||||
- Keep AST introspection as a debugging and research-lab capability, not as
|
||||
the default user interface.
|
||||
|
||||
Sources:
|
||||
|
||||
- https://www.rfc-editor.org/rfc/rfc9535.html
|
||||
- https://www.sqlite.org/json1.html
|
||||
- https://www.sqlite.org/fts5.html
|
||||
|
||||
### Columnar And Vector Backends
|
||||
|
||||
Apache Arrow defines a language-independent columnar memory format. DuckDB is
|
||||
strong for local analytical SQL over structured data. Vector databases such as
|
||||
Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
|
||||
|
||||
Relevance:
|
||||
|
||||
- The core should not depend on any vector database.
|
||||
- Index backends should advertise capabilities: text search, SQL, JSONPath,
|
||||
vector search, hybrid retrieval, analytical scans.
|
||||
- Vector indexes should store provenance back to document, section, and content
|
||||
hash, not just opaque chunks.
|
||||
|
||||
Sources:
|
||||
|
||||
- https://arrow.apache.org/docs/format/Columnar.html
|
||||
- https://duckdb.org/docs/stable/data/json/overview
|
||||
- https://qdrant.tech/documentation/manage-data/collections/
|
||||
- https://docs.lancedb.com/
|
||||
- https://github.com/pgvector/pgvector
|
||||
|
||||
### Agent Context And Working Memory
|
||||
|
||||
The Model Context Protocol gives a useful integration model: resources provide
|
||||
context/data, tools execute actions, and roots define filesystem or URI
|
||||
boundaries. LangChain/LangGraph memory docs distinguish short-term,
|
||||
thread-scoped memory from long-term, namespace-scoped memory, and further split
|
||||
long-term memory into semantic, episodic, and procedural forms. The MemGPT
|
||||
paper frames memory management as an operating-system-like problem for LLMs.
|
||||
|
||||
Relevance:
|
||||
|
||||
- Markitect context caches should be namespace-scoped and explicitly
|
||||
activatable.
|
||||
- A context package should carry text, structure, provenance, policy, freshness,
|
||||
and token-budget metadata.
|
||||
- Agents should be able to drop and reactivate working context by stable id.
|
||||
- Memory writes need hot-path and background modes.
|
||||
|
||||
Sources:
|
||||
|
||||
- https://modelcontextprotocol.io/specification/2025-06-18
|
||||
- https://docs.langchain.com/oss/python/concepts/memory
|
||||
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
|
||||
- https://arxiv.org/abs/2310.08560
|
||||
|
||||
### Provenance, Observability, And Debuggability
|
||||
|
||||
W3C PROV provides a vocabulary for entities, activities, agents, and
|
||||
derivations. OpenTelemetry traces provide spans and attributes for observing
|
||||
distributed or multi-step operations.
|
||||
|
||||
Relevance:
|
||||
|
||||
- Cache entries should record what produced them.
|
||||
- Query results should be explainable: source file, section, content hash,
|
||||
index backend, policy decision, and transform chain.
|
||||
- Agent context packs should be auditable.
|
||||
|
||||
Sources:
|
||||
|
||||
- https://www.w3.org/TR/prov-overview/
|
||||
- https://opentelemetry.io/docs/concepts/signals/traces/
|
||||
|
||||
### Access Control: Fluid To Rigid
|
||||
|
||||
Zanzibar demonstrates a relationship-based authorization model at large scale.
|
||||
OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
|
||||
available as productized systems. OPA/Rego and Cedar provide policy evaluation
|
||||
models for attribute and rule-based decisions.
|
||||
|
||||
Relevance:
|
||||
|
||||
- Markitect should support a fluid-to-rigid access-control ladder.
|
||||
- Local labs can start with labels and trust scopes.
|
||||
- Secure deployments need policy checks before query results are returned to
|
||||
agents or users.
|
||||
- Policy decisions should be part of the diagnostic and provenance trail.
|
||||
|
||||
Sources:
|
||||
|
||||
- https://www.usenix.org/conference/atc19/presentation/pang
|
||||
- https://openfga.dev/docs/concepts
|
||||
- https://www.openpolicyagent.org/docs/policy-language
|
||||
- https://docs.cedarpolicy.com/
|
||||
|
||||
## Main Finding
|
||||
|
||||
The optional backend should be a **capability-oriented cache fabric**, not a
|
||||
single database choice.
|
||||
|
||||
The slim core should continue to parse, validate, query, transform, and
|
||||
generate Markdown without persistent infrastructure. The research-lab backend
|
||||
should attach through explicit interfaces:
|
||||
|
||||
- content-addressed snapshots
|
||||
- index manifests
|
||||
- query adapter registry
|
||||
- memory/context package registry
|
||||
- access policy gateway
|
||||
- provenance and trace records
|
||||
|
||||
That lets the project support spontaneous one-time tool use and also grow into
|
||||
high-performance, agentic, security-sensitive knowledge systems.
|
||||
|
||||
## Most Promising Use Cases
|
||||
|
||||
### UC-RL-001: AST Introspection And JSONPath Backend
|
||||
|
||||
Expose raw parsed documents for advanced users:
|
||||
|
||||
- `mkt ast show`
|
||||
- `mkt ast query --backend jsonpath`
|
||||
- raw token and inline query support
|
||||
- adapter path from simple selectors to JSONPath where possible
|
||||
|
||||
Utility:
|
||||
|
||||
- debugging parser behavior
|
||||
- developing transforms
|
||||
- power-user structural extraction
|
||||
- migration path for legacy `markitect-main` AST workflows
|
||||
|
||||
### UC-RL-002: Local Persistent Knowledge Index
|
||||
|
||||
Build a local cache/index for a repo or document collection:
|
||||
|
||||
- content-addressed document snapshots
|
||||
- SQLite JSON tables for structure
|
||||
- SQLite FTS5 for section/block text search
|
||||
- optional DuckDB/Arrow export for analytical work
|
||||
- incremental refresh based on content hashes
|
||||
|
||||
Utility:
|
||||
|
||||
- fast repeated queries
|
||||
- search across many Markdown files
|
||||
- offline/local-first knowledge work
|
||||
- foundation for batch transforms and generation pipelines
|
||||
|
||||
### UC-RL-003: Agent Working Memory Cache
|
||||
|
||||
Create activatable context packages for LLM agents:
|
||||
|
||||
- namespace-scoped memories
|
||||
- short-term working sets and long-term caches
|
||||
- semantic/episodic/procedural memory categories
|
||||
- drop/reactivate by stable id
|
||||
- token-budget-aware context assembly
|
||||
- provenance and freshness metadata
|
||||
|
||||
Utility:
|
||||
|
||||
- efficient agent work across long projects
|
||||
- reusable context packs for recurring tasks
|
||||
- controlled memory updates and recall
|
||||
- bridge from Markitect documents to agent infrastructure
|
||||
|
||||
### UC-RL-004: Access-Controlled Knowledge Gateway
|
||||
|
||||
Add policy enforcement to cached retrieval:
|
||||
|
||||
- labels/trust zones for local use
|
||||
- ACL/ReBAC/ABAC adapters for stricter systems
|
||||
- policy-aware query result filtering
|
||||
- decision logs and diagnostics
|
||||
- secure context packages for LLM use
|
||||
|
||||
Utility:
|
||||
|
||||
- enterprise and IT-security use cases
|
||||
- multi-tenant knowledge bases
|
||||
- agent access control
|
||||
- auditable data exposure
|
||||
|
||||
## Design Principles
|
||||
|
||||
- The core remains infrastructure-free.
|
||||
- Backends are optional and capability-declared.
|
||||
- Every cached object is content-addressed or provenance-addressed.
|
||||
- Query adapters return the same match/result envelope.
|
||||
- Policy is checked before data leaves a backend boundary.
|
||||
- Context packages are explicit, droppable, and reactivatable.
|
||||
- LLM memory is data with provenance, not invisible prompt residue.
|
||||
- Experimental backends belong behind stable contracts.
|
||||
Reference in New Issue
Block a user