coulomb/markitect-tool

Fork 0

generated from coulomb/repo-seed

Files

tegwick 6f0facd744 Workplan dependencies and prio for text research lab workplans

2026-05-04 00:12:07 +02:00

8.4 KiB

Raw Blame History

Research Lab: Sophisticated Cache Backends

Date: 2026-05-03

Purpose

This research note explores how markitect-tool can keep its slim, markdown-native core while allowing sophisticated optional backends for cached ASTs, structured indexes, multiple query paradigms, agent working memory, and access-controlled knowledge systems.

The goal is not to rebuild markitect-main wholesale. The goal is to preserve the useful insight behind it: once Markdown has been parsed into a trustworthy structure, many higher-value operations become possible if that structure can be cached, indexed, queried, reactivated, and governed.

Research Signals

Content Addressing And Reproducibility

Git's object model is a practical reference for content-addressed storage: content is written to an object database and retrieved by a hash-derived key. Bazel remote caching similarly separates action outputs from metadata so work can be reused when inputs are unchanged.

Relevance:

Parse results should be keyed by content hash, parser version, and options.
Derived indexes should declare their input snapshots and invalidation rules.
Reproducible context packages need stable object identities.

Sources:

Structured Query And AST Introspection

JSONPath is now standardized as RFC 9535. It defines selection and extraction over JSON values and has security considerations around implementation behavior and query construction. This makes it a good optional backend for power users who need raw access to the full parsed structure.

SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5 supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and external-content tables. These features map well to Markdown sections and blocks while keeping local-first operation.

Relevance:

Keep the current simple selector API as the common surface.
Add JSONPath over Document.to_dict() as an optional advanced adapter.
Add SQLite as the first local persistent index backend.
Keep AST introspection as a debugging and research-lab capability, not as the default user interface.

Sources:

Columnar And Vector Backends

Apache Arrow defines a language-independent columnar memory format. DuckDB is strong for local analytical SQL over structured data. Vector databases such as Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.

Relevance:

The core should not depend on any vector database.
Index backends should advertise capabilities: text search, SQL, JSONPath, vector search, hybrid retrieval, analytical scans.
Vector indexes should store provenance back to document, section, and content hash, not just opaque chunks.

Sources:

Agent Context And Working Memory

The Model Context Protocol gives a useful integration model: resources provide context/data, tools execute actions, and roots define filesystem or URI boundaries. LangChain/LangGraph memory docs distinguish short-term, thread-scoped memory from long-term, namespace-scoped memory, and further split long-term memory into semantic, episodic, and procedural forms. The MemGPT paper frames memory management as an operating-system-like problem for LLMs.

Relevance:

Markitect context caches should be namespace-scoped and explicitly activatable.
A context package should carry text, structure, provenance, policy, freshness, and token-budget metadata.
Agents should be able to drop and reactivate working context by stable id.
Memory writes need hot-path and background modes.

Sources:

Provenance, Observability, And Debuggability

W3C PROV provides a vocabulary for entities, activities, agents, and derivations. OpenTelemetry traces provide spans and attributes for observing distributed or multi-step operations.

Relevance:

Cache entries should record what produced them.
Query results should be explainable: source file, section, content hash, index backend, policy decision, and transform chain.
Agent context packs should be auditable.

Sources:

Access Control: Fluid To Rigid

Zanzibar demonstrates a relationship-based authorization model at large scale. OpenFGA and SpiceDB make Zanzibar-style relationship-based access control available as productized systems. OPA/Rego and Cedar provide policy evaluation models for attribute and rule-based decisions.

Relevance:

Markitect should support a fluid-to-rigid access-control ladder.
Local labs can start with labels and trust scopes.
Secure deployments need policy checks before query results are returned to agents or users.
Policy decisions should be part of the diagnostic and provenance trail.

Sources:

Main Finding

The optional backend should be a capability-oriented cache fabric, not a single database choice.

The slim core should continue to parse, validate, query, transform, and generate Markdown without persistent infrastructure. The research-lab backend should attach through explicit interfaces:

content-addressed snapshots
index manifests
query adapter registry
memory/context package registry
access policy gateway
provenance and trace records

That lets the project support spontaneous one-time tool use and also grow into high-performance, agentic, security-sensitive knowledge systems.

Most Promising Use Cases

UC-RL-001: AST Introspection And JSONPath Backend

Expose raw parsed documents for advanced users:

mkt ast show
mkt ast query --backend jsonpath
raw token and inline query support
adapter path from simple selectors to JSONPath where possible

Utility:

debugging parser behavior
developing transforms
power-user structural extraction
migration path for legacy markitect-main AST workflows

UC-RL-002: Local Persistent Knowledge Index

Build a local cache/index for a repo or document collection:

content-addressed document snapshots
SQLite JSON tables for structure
SQLite FTS5 for section/block text search
optional DuckDB/Arrow export for analytical work
incremental refresh based on content hashes

Utility:

fast repeated queries
search across many Markdown files
offline/local-first knowledge work
foundation for batch transforms and generation pipelines

UC-RL-003: Agent Working Memory Cache

Create activatable context packages for LLM agents:

namespace-scoped memories
short-term working sets and long-term caches
semantic/episodic/procedural memory categories
drop/reactivate by stable id
token-budget-aware context assembly
provenance and freshness metadata

Utility:

efficient agent work across long projects
reusable context packs for recurring tasks
controlled memory updates and recall
bridge from Markitect documents to agent infrastructure

UC-RL-004: Access-Controlled Knowledge Gateway

Add policy enforcement to cached retrieval:

labels/trust zones for local use
ACL/ReBAC/ABAC adapters for stricter systems
policy-aware query result filtering
decision logs and diagnostics
secure context packages for LLM use

Utility:

enterprise and IT-security use cases
multi-tenant knowledge bases
agent access control
auditable data exposure

Design Principles

The core remains infrastructure-free.
Backends are optional and capability-declared.
Every cached object is content-addressed or provenance-addressed.
Query adapters return the same match/result envelope.
Policy is checked before data leaves a backend boundary.
Context packages are explicit, droppable, and reactivatable.
LLM memory is data with provenance, not invisible prompt residue.
Experimental backends belong behind stable contracts.

8.4 KiB Raw Blame History

Research Lab: Sophisticated Cache Backends

Purpose

Research Signals

Content Addressing And Reproducibility

Structured Query And AST Introspection

Columnar And Vector Backends

Agent Context And Working Memory

Provenance, Observability, And Debuggability

Access Control: Fluid To Rigid

Main Finding

Most Promising Use Cases

UC-RL-001: AST Introspection And JSONPath Backend

UC-RL-002: Local Persistent Knowledge Index

UC-RL-003: Agent Working Memory Cache

UC-RL-004: Access-Controlled Knowledge Gateway

Design Principles

8.4 KiB

Raw Blame History