Workplan dependencies and prio for text research lab workplans

This commit is contained in:
2026-05-04 00:12:07 +02:00
parent 4fc891c076
commit 6f0facd744
18 changed files with 1644 additions and 1 deletions

View File

@@ -0,0 +1,259 @@
# Cache Backend Architecture Blueprint
Date: 2026-05-03
## Purpose
This blueprint defines an optional backend architecture for sophisticated
knowledge systems built on top of `markitect-tool`.
It is a research-lab architecture: powerful enough to support cached ASTs,
advanced query backends, agent memory, and access control, but separated from
the slim core so one-off CLI use stays fast and simple.
## Architectural Boundary
The core package owns:
- Markdown parsing
- document contracts
- simple selectors
- deterministic transforms and generation primitives
- unified diagnostics
The optional backend fabric owns:
- persistent snapshots
- indexes
- advanced query adapters
- memory/context packages
- policy enforcement
- provenance records
- trace and performance metadata
The core must be able to run without the backend fabric.
## Conceptual Layers
```text
Markdown files
-> Core parser and contract layer
-> Content-addressed document snapshots
-> Index fabric
-> AST/JSON index
-> full-text index
-> vector/semantic index
-> analytical/index export
-> Query adapter registry
-> simple selectors
-> JSONPath
-> SQL/FTS
-> vector/hybrid retrieval
-> Context package registry
-> activated working sets
-> memory namespaces
-> agent-ready context bundles
-> Access policy gateway
-> labels/ACL/ReBAC/ABAC
-> result filtering and denial diagnostics
-> Provenance and observability
```
## Core Interfaces
### Snapshot Backend
Responsible for durable parsed-document snapshots.
Minimum protocol:
```text
put_document(source_path, content, parse_options) -> snapshot_id
get_snapshot(snapshot_id) -> DocumentSnapshot
resolve_source(source_path) -> latest snapshot_id
diff_snapshot(old_id, new_id) -> SnapshotDiff
```
Snapshot identity should include:
- source content hash
- parser version
- parse options
- contract version when relevant
### Index Backend
Responsible for derived lookup structures.
Minimum protocol:
```text
capabilities() -> IndexCapabilities
build(snapshot_ids, options) -> IndexBuildResult
refresh(changed_snapshots) -> IndexBuildResult
query(request) -> QueryResult
explain(request) -> QueryPlan
```
Capabilities should include:
- `jsonpath`
- `sql`
- `fts`
- `vector`
- `hybrid`
- `inline_tokens`
- `section_graph`
- `policy_pushdown`
### Query Adapter
Translates a stable Markitect query request into backend-specific execution.
Minimum protocol:
```text
name
supports(selector_or_query, target) -> bool
execute(document_or_backend, request) -> QueryResult
explain(request) -> QueryExplanation
```
Adapters must return a common result envelope:
- kind
- path
- value
- text
- source location
- snapshot id
- provenance
- policy decision
- backend metadata
### Context Package Registry
Responsible for agent-ready working memory.
Minimum protocol:
```text
create_package(query_or_manifest, budget, policy) -> context_package_id
activate(package_id, thread_or_workspace) -> activation_id
deactivate(activation_id)
refresh(package_id) -> package_id
explain(package_id) -> ContextPackageReport
```
Context packages should include:
- included source spans
- summary layers
- token estimates
- provenance
- freshness
- policy labels
- retrieval recipe
- cache keys
### Access Policy Gateway
Responsible for authorization and redaction before results leave a backend.
Minimum protocol:
```text
authorize(subject, action, object, context) -> PolicyDecision
filter_results(subject, action, results, context) -> FilteredResults
explain_decision(decision_id) -> PolicyExplanation
```
Policy should support a ladder:
1. Labels and trust zones.
2. File/path ACLs.
3. Relationship-based access control.
4. Attribute/rule-based policies.
5. External authorization services.
## Suggested Backend Manifest
Backends should register through a Markdown/YAML manifest:
````markdown
# Local SQLite Backend
```yaml markitect-backend
id: local-sqlite-cache
kind: cache-backend
capabilities:
- snapshots
- json
- fts
- sql
- provenance
storage:
engine: sqlite
path: .markitect/cache/index.sqlite
policy:
mode: labels
```
````
## CLI Direction
The first backend CLI should be explicit:
```text
mkt cache init
mkt cache build <path>
mkt cache status
mkt cache query <selector-or-query> --backend <name>
mkt ast show <file>
mkt ast query <file> <jsonpath>
mkt context pack <manifest-or-query>
mkt context activate <package-id>
mkt policy check <subject> <action> <object>
```
Do not hide persistence behind `mkt query`. The user should know when the tool
is querying live files versus a persistent backend.
## Recommended First Stack
Start with:
- content hashes in Python standard library
- SQLite for snapshot metadata, JSON, and FTS5
- JSONPath as an optional extra
- local filesystem cache directory
- simple label policy
- provenance tables
Defer:
- vector search until text/structure cache works
- external authorization engines until local policy model is stable
- MCP server exposure until resources/tools are secure and explainable
- distributed cache until local invalidation is boring
## Security Notes
Cached data becomes a new data exposure surface.
Minimum requirements before secure use:
- cache location is explicit
- cache entries know source path and content hash
- policy mode is visible
- query results report policy filtering
- context packages list what they include
- destructive cache operations require explicit command
- no backend silently sends document content to a network service
## Architecture Decision
Implement the backend fabric after deterministic transform/composition
primitives are underway, but before serious caching, agent memory, or advanced
query backends. This lets WP-0003 continue while reserving a clean path for the
research-lab track.

76
docs/query-extraction.md Normal file
View File

@@ -0,0 +1,76 @@
# Query And Extraction
Date: 2026-05-03
## Purpose
The first query layer keeps selection close to the structured Markdown model.
It is intentionally small and deterministic. JSONPath or another query backend
can be added later behind the same API if the simple selector language becomes
too limited.
## CLI
```text
mkt query <document.md> <selector> [--format json|yaml|text]
mkt extract <document.md> <selector> [--format text|json|yaml]
```
`query` returns structured matches. `extract` returns textual content from the
matches.
## Selectors
Supported targets:
- `document`, `$`, or `.`: full parsed document
- `frontmatter`: YAML frontmatter
- `headings`: heading objects
- `sections`: heading-led sections
- `blocks`: parsed content blocks
- `metrics`: document and section metrics
Supported path examples:
```text
frontmatter.status
frontmatter.owner.name
metrics.document.words
metrics.document.sections
```
Supported filters:
```text
headings[level=2]
headings[text=Decision]
headings[text~=decision]
sections[heading=Context]
sections[heading~=risk]
sections[contains=problem]
sections[contains~=PROBLEM]
blocks[type=paragraph]
blocks[contains~=follow-up]
```
`=` is exact and case-sensitive. `~=` is substring matching and
case-insensitive.
## Current Boundary
This is not a full query language. It covers practical extraction from the
current parser model:
- frontmatter values
- headings
- sections
- content blocks
- metrics
Future query backend work should preserve this simple surface and add optional
adapters rather than forcing every user into a heavier language.
Advanced query and cache backends are tracked in:
- `docs/cache-backend-architecture-blueprint.md`
- `workplans/MKTT-WP-0007-advanced-query-and-local-index-backend.md`

View File

@@ -0,0 +1,248 @@
# Research Lab: Sophisticated Cache Backends
Date: 2026-05-03
## Purpose
This research note explores how `markitect-tool` can keep its slim,
markdown-native core while allowing sophisticated optional backends for cached
ASTs, structured indexes, multiple query paradigms, agent working memory, and
access-controlled knowledge systems.
The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
the useful insight behind it: once Markdown has been parsed into a trustworthy
structure, many higher-value operations become possible if that structure can
be cached, indexed, queried, reactivated, and governed.
## Research Signals
### Content Addressing And Reproducibility
Git's object model is a practical reference for content-addressed storage:
content is written to an object database and retrieved by a hash-derived key.
Bazel remote caching similarly separates action outputs from metadata so work
can be reused when inputs are unchanged.
Relevance:
- Parse results should be keyed by content hash, parser version, and options.
- Derived indexes should declare their input snapshots and invalidation rules.
- Reproducible context packages need stable object identities.
Sources:
- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
- https://docs.bazel.build/versions/main/remote-caching.html
### Structured Query And AST Introspection
JSONPath is now standardized as RFC 9535. It defines selection and extraction
over JSON values and has security considerations around implementation behavior
and query construction. This makes it a good optional backend for power users
who need raw access to the full parsed structure.
SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
external-content tables. These features map well to Markdown sections and
blocks while keeping local-first operation.
Relevance:
- Keep the current simple selector API as the common surface.
- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
- Add SQLite as the first local persistent index backend.
- Keep AST introspection as a debugging and research-lab capability, not as
the default user interface.
Sources:
- https://www.rfc-editor.org/rfc/rfc9535.html
- https://www.sqlite.org/json1.html
- https://www.sqlite.org/fts5.html
### Columnar And Vector Backends
Apache Arrow defines a language-independent columnar memory format. DuckDB is
strong for local analytical SQL over structured data. Vector databases such as
Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
Relevance:
- The core should not depend on any vector database.
- Index backends should advertise capabilities: text search, SQL, JSONPath,
vector search, hybrid retrieval, analytical scans.
- Vector indexes should store provenance back to document, section, and content
hash, not just opaque chunks.
Sources:
- https://arrow.apache.org/docs/format/Columnar.html
- https://duckdb.org/docs/stable/data/json/overview
- https://qdrant.tech/documentation/manage-data/collections/
- https://docs.lancedb.com/
- https://github.com/pgvector/pgvector
### Agent Context And Working Memory
The Model Context Protocol gives a useful integration model: resources provide
context/data, tools execute actions, and roots define filesystem or URI
boundaries. LangChain/LangGraph memory docs distinguish short-term,
thread-scoped memory from long-term, namespace-scoped memory, and further split
long-term memory into semantic, episodic, and procedural forms. The MemGPT
paper frames memory management as an operating-system-like problem for LLMs.
Relevance:
- Markitect context caches should be namespace-scoped and explicitly
activatable.
- A context package should carry text, structure, provenance, policy, freshness,
and token-budget metadata.
- Agents should be able to drop and reactivate working context by stable id.
- Memory writes need hot-path and background modes.
Sources:
- https://modelcontextprotocol.io/specification/2025-06-18
- https://docs.langchain.com/oss/python/concepts/memory
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
- https://arxiv.org/abs/2310.08560
### Provenance, Observability, And Debuggability
W3C PROV provides a vocabulary for entities, activities, agents, and
derivations. OpenTelemetry traces provide spans and attributes for observing
distributed or multi-step operations.
Relevance:
- Cache entries should record what produced them.
- Query results should be explainable: source file, section, content hash,
index backend, policy decision, and transform chain.
- Agent context packs should be auditable.
Sources:
- https://www.w3.org/TR/prov-overview/
- https://opentelemetry.io/docs/concepts/signals/traces/
### Access Control: Fluid To Rigid
Zanzibar demonstrates a relationship-based authorization model at large scale.
OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
available as productized systems. OPA/Rego and Cedar provide policy evaluation
models for attribute and rule-based decisions.
Relevance:
- Markitect should support a fluid-to-rigid access-control ladder.
- Local labs can start with labels and trust scopes.
- Secure deployments need policy checks before query results are returned to
agents or users.
- Policy decisions should be part of the diagnostic and provenance trail.
Sources:
- https://www.usenix.org/conference/atc19/presentation/pang
- https://openfga.dev/docs/concepts
- https://www.openpolicyagent.org/docs/policy-language
- https://docs.cedarpolicy.com/
## Main Finding
The optional backend should be a **capability-oriented cache fabric**, not a
single database choice.
The slim core should continue to parse, validate, query, transform, and
generate Markdown without persistent infrastructure. The research-lab backend
should attach through explicit interfaces:
- content-addressed snapshots
- index manifests
- query adapter registry
- memory/context package registry
- access policy gateway
- provenance and trace records
That lets the project support spontaneous one-time tool use and also grow into
high-performance, agentic, security-sensitive knowledge systems.
## Most Promising Use Cases
### UC-RL-001: AST Introspection And JSONPath Backend
Expose raw parsed documents for advanced users:
- `mkt ast show`
- `mkt ast query --backend jsonpath`
- raw token and inline query support
- adapter path from simple selectors to JSONPath where possible
Utility:
- debugging parser behavior
- developing transforms
- power-user structural extraction
- migration path for legacy `markitect-main` AST workflows
### UC-RL-002: Local Persistent Knowledge Index
Build a local cache/index for a repo or document collection:
- content-addressed document snapshots
- SQLite JSON tables for structure
- SQLite FTS5 for section/block text search
- optional DuckDB/Arrow export for analytical work
- incremental refresh based on content hashes
Utility:
- fast repeated queries
- search across many Markdown files
- offline/local-first knowledge work
- foundation for batch transforms and generation pipelines
### UC-RL-003: Agent Working Memory Cache
Create activatable context packages for LLM agents:
- namespace-scoped memories
- short-term working sets and long-term caches
- semantic/episodic/procedural memory categories
- drop/reactivate by stable id
- token-budget-aware context assembly
- provenance and freshness metadata
Utility:
- efficient agent work across long projects
- reusable context packs for recurring tasks
- controlled memory updates and recall
- bridge from Markitect documents to agent infrastructure
### UC-RL-004: Access-Controlled Knowledge Gateway
Add policy enforcement to cached retrieval:
- labels/trust zones for local use
- ACL/ReBAC/ABAC adapters for stricter systems
- policy-aware query result filtering
- decision logs and diagnostics
- secure context packages for LLM use
Utility:
- enterprise and IT-security use cases
- multi-tenant knowledge bases
- agent access control
- auditable data exposure
## Design Principles
- The core remains infrastructure-free.
- Backends are optional and capability-declared.
- Every cached object is content-addressed or provenance-addressed.
- Query adapters return the same match/result envelope.
- Policy is checked before data leaves a backend boundary.
- Context packages are explicit, droppable, and reactivatable.
- LLM memory is data with provenance, not invisible prompt residue.
- Experimental backends belong behind stable contracts.

View File

@@ -0,0 +1,68 @@
# Workplan Planning Map
Date: 2026-05-03
## Purpose
This document captures the current sequencing and priority view for
`markitect-tool` workplans.
State Hub currently supports workstream dependency edges, but it does not yet
have native workstream priority/order fields and does not ingest dependency
metadata from workplan frontmatter. Until that exists, this file and the
workplan frontmatter are the repo source of truth; State Hub dependency edges
and descriptions mirror the operational view.
## Priority Scale
| Priority | Meaning |
| --- | --- |
| `P0` | Current mainline work. |
| `P1` | Next enabling architecture or implementation work. |
| `P2` | High-value follow-on work, start when trigger conditions are met. |
| `P3` | Research-lab or security-sensitive extension work. |
| `complete` | Finished foundation or completed decision work. |
## Current Ordering
| Workplan | Priority | Status | Depends On | Current View |
| --- | --- | --- | --- | --- |
| `MKTT-WP-0001` | complete | done | none | Repository foundation is complete. |
| `MKTT-WP-0002` | complete | done | `MKTT-WP-0001` | Legacy scope extraction is complete. |
| `MKTT-WP-0004` | complete | done | `MKTT-WP-0001`, `MKTT-WP-0002` | Contract framework is complete and informs later validation/generation work. |
| `MKTT-WP-0003` | P0 | active | `MKTT-WP-0001`, `MKTT-WP-0002`, `MKTT-WP-0004` | Mainline implementation. Continue with P3.5 transform/compose/include. |
| `MKTT-WP-0006` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T005` | Start after transform/composition shape is clear and before serious cache work. |
| `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. |
| `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. |
| `MKTT-WP-0009` | P2 | todo | `MKTT-WP-0006` | Establish access-control gateway before security-sensitive cache/context use. |
| `MKTT-WP-0008` | P3 | todo | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory cache after backend and policy floor are available. |
## Dependency Notes
The most important nuance is `MKTT-WP-0006`: it should not wait for every task
in `MKTT-WP-0003`, because it should shape cache architecture before `P3.7`.
It should wait until `MKTT-WP-0003-T005` gives transform/composition enough
shape to know what cached identities and invalidation rules must preserve.
This is a mixed task/workstream dependency. State Hub does not currently model
that natively.
## State Hub Mirror
Native State Hub dependency edges should mirror the whole-workstream
dependencies:
- `MKTT-WP-0002 -> MKTT-WP-0001`
- `MKTT-WP-0004 -> MKTT-WP-0001`
- `MKTT-WP-0004 -> MKTT-WP-0002`
- `MKTT-WP-0003 -> MKTT-WP-0001`
- `MKTT-WP-0003 -> MKTT-WP-0002`
- `MKTT-WP-0003 -> MKTT-WP-0004`
- `MKTT-WP-0006 -> MKTT-WP-0004`
- `MKTT-WP-0007 -> MKTT-WP-0006`
- `MKTT-WP-0005 -> MKTT-WP-0003`
- `MKTT-WP-0005 -> MKTT-WP-0004`
- `MKTT-WP-0009 -> MKTT-WP-0006`
- `MKTT-WP-0008 -> MKTT-WP-0006`
- `MKTT-WP-0008 -> MKTT-WP-0007`
- `MKTT-WP-0008 -> MKTT-WP-0009`