Workplan dependencies and prio for text research lab workplans

2026-05-04 00:12:07 +02:00
parent 4fc891c076
commit 6f0facd744
18 changed files with 1644 additions and 1 deletions
--- a/docs/research-lab-cache-backend-research.md
+++ b/docs/research-lab-cache-backend-research.md
@@ -0,0 +1,248 @@
+# Research Lab: Sophisticated Cache Backends
+
+Date: 2026-05-03
+
+## Purpose
+
+This research note explores how `markitect-tool` can keep its slim,
+markdown-native core while allowing sophisticated optional backends for cached
+ASTs, structured indexes, multiple query paradigms, agent working memory, and
+access-controlled knowledge systems.
+
+The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
+the useful insight behind it: once Markdown has been parsed into a trustworthy
+structure, many higher-value operations become possible if that structure can
+be cached, indexed, queried, reactivated, and governed.
+
+## Research Signals
+
+### Content Addressing And Reproducibility
+
+Git's object model is a practical reference for content-addressed storage:
+content is written to an object database and retrieved by a hash-derived key.
+Bazel remote caching similarly separates action outputs from metadata so work
+can be reused when inputs are unchanged.
+
+Relevance:
+
+- Parse results should be keyed by content hash, parser version, and options.
+- Derived indexes should declare their input snapshots and invalidation rules.
+- Reproducible context packages need stable object identities.
+
+Sources:
+
+- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
+- https://docs.bazel.build/versions/main/remote-caching.html
+
+### Structured Query And AST Introspection
+
+JSONPath is now standardized as RFC 9535. It defines selection and extraction
+over JSON values and has security considerations around implementation behavior
+and query construction. This makes it a good optional backend for power users
+who need raw access to the full parsed structure.
+
+SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
+supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
+external-content tables. These features map well to Markdown sections and
+blocks while keeping local-first operation.
+
+Relevance:
+
+- Keep the current simple selector API as the common surface.
+- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
+- Add SQLite as the first local persistent index backend.
+- Keep AST introspection as a debugging and research-lab capability, not as
+  the default user interface.
+
+Sources:
+
+- https://www.rfc-editor.org/rfc/rfc9535.html
+- https://www.sqlite.org/json1.html
+- https://www.sqlite.org/fts5.html
+
+### Columnar And Vector Backends
+
+Apache Arrow defines a language-independent columnar memory format. DuckDB is
+strong for local analytical SQL over structured data. Vector databases such as
+Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
+
+Relevance:
+
+- The core should not depend on any vector database.
+- Index backends should advertise capabilities: text search, SQL, JSONPath,
+  vector search, hybrid retrieval, analytical scans.
+- Vector indexes should store provenance back to document, section, and content
+  hash, not just opaque chunks.
+
+Sources:
+
+- https://arrow.apache.org/docs/format/Columnar.html
+- https://duckdb.org/docs/stable/data/json/overview
+- https://qdrant.tech/documentation/manage-data/collections/
+- https://docs.lancedb.com/
+- https://github.com/pgvector/pgvector
+
+### Agent Context And Working Memory
+
+The Model Context Protocol gives a useful integration model: resources provide
+context/data, tools execute actions, and roots define filesystem or URI
+boundaries. LangChain/LangGraph memory docs distinguish short-term,
+thread-scoped memory from long-term, namespace-scoped memory, and further split
+long-term memory into semantic, episodic, and procedural forms. The MemGPT
+paper frames memory management as an operating-system-like problem for LLMs.
+
+Relevance:
+
+- Markitect context caches should be namespace-scoped and explicitly
+  activatable.
+- A context package should carry text, structure, provenance, policy, freshness,
+  and token-budget metadata.
+- Agents should be able to drop and reactivate working context by stable id.
+- Memory writes need hot-path and background modes.
+
+Sources:
+
+- https://modelcontextprotocol.io/specification/2025-06-18
+- https://docs.langchain.com/oss/python/concepts/memory
+- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
+- https://arxiv.org/abs/2310.08560
+
+### Provenance, Observability, And Debuggability
+
+W3C PROV provides a vocabulary for entities, activities, agents, and
+derivations. OpenTelemetry traces provide spans and attributes for observing
+distributed or multi-step operations.
+
+Relevance:
+
+- Cache entries should record what produced them.
+- Query results should be explainable: source file, section, content hash,
+  index backend, policy decision, and transform chain.
+- Agent context packs should be auditable.
+
+Sources:
+
+- https://www.w3.org/TR/prov-overview/
+- https://opentelemetry.io/docs/concepts/signals/traces/
+
+### Access Control: Fluid To Rigid
+
+Zanzibar demonstrates a relationship-based authorization model at large scale.
+OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
+available as productized systems. OPA/Rego and Cedar provide policy evaluation
+models for attribute and rule-based decisions.
+
+Relevance:
+
+- Markitect should support a fluid-to-rigid access-control ladder.
+- Local labs can start with labels and trust scopes.
+- Secure deployments need policy checks before query results are returned to
+  agents or users.
+- Policy decisions should be part of the diagnostic and provenance trail.
+
+Sources:
+
+- https://www.usenix.org/conference/atc19/presentation/pang
+- https://openfga.dev/docs/concepts
+- https://www.openpolicyagent.org/docs/policy-language
+- https://docs.cedarpolicy.com/
+
+## Main Finding
+
+The optional backend should be a **capability-oriented cache fabric**, not a
+single database choice.
+
+The slim core should continue to parse, validate, query, transform, and
+generate Markdown without persistent infrastructure. The research-lab backend
+should attach through explicit interfaces:
+
+- content-addressed snapshots
+- index manifests
+- query adapter registry
+- memory/context package registry
+- access policy gateway
+- provenance and trace records
+
+That lets the project support spontaneous one-time tool use and also grow into
+high-performance, agentic, security-sensitive knowledge systems.
+
+## Most Promising Use Cases
+
+### UC-RL-001: AST Introspection And JSONPath Backend
+
+Expose raw parsed documents for advanced users:
+
+- `mkt ast show`
+- `mkt ast query --backend jsonpath`
+- raw token and inline query support
+- adapter path from simple selectors to JSONPath where possible
+
+Utility:
+
+- debugging parser behavior
+- developing transforms
+- power-user structural extraction
+- migration path for legacy `markitect-main` AST workflows
+
+### UC-RL-002: Local Persistent Knowledge Index
+
+Build a local cache/index for a repo or document collection:
+
+- content-addressed document snapshots
+- SQLite JSON tables for structure
+- SQLite FTS5 for section/block text search
+- optional DuckDB/Arrow export for analytical work
+- incremental refresh based on content hashes
+
+Utility:
+
+- fast repeated queries
+- search across many Markdown files
+- offline/local-first knowledge work
+- foundation for batch transforms and generation pipelines
+
+### UC-RL-003: Agent Working Memory Cache
+
+Create activatable context packages for LLM agents:
+
+- namespace-scoped memories
+- short-term working sets and long-term caches
+- semantic/episodic/procedural memory categories
+- drop/reactivate by stable id
+- token-budget-aware context assembly
+- provenance and freshness metadata
+
+Utility:
+
+- efficient agent work across long projects
+- reusable context packs for recurring tasks
+- controlled memory updates and recall
+- bridge from Markitect documents to agent infrastructure
+
+### UC-RL-004: Access-Controlled Knowledge Gateway
+
+Add policy enforcement to cached retrieval:
+
+- labels/trust zones for local use
+- ACL/ReBAC/ABAC adapters for stricter systems
+- policy-aware query result filtering
+- decision logs and diagnostics
+- secure context packages for LLM use
+
+Utility:
+
+- enterprise and IT-security use cases
+- multi-tenant knowledge bases
+- agent access control
+- auditable data exposure
+
+## Design Principles
+
+- The core remains infrastructure-free.
+- Backends are optional and capability-declared.
+- Every cached object is content-addressed or provenance-addressed.
+- Query adapters return the same match/result envelope.
+- Policy is checked before data leaves a backend boundary.
+- Context packages are explicit, droppable, and reactivatable.
+- LLM memory is data with provenance, not invisible prompt residue.
+- Experimental backends belong behind stable contracts.