Workplan dependencies and prio for text research lab workplans

2026-05-04 00:12:07 +02:00
parent 4fc891c076
commit 6f0facd744
18 changed files with 1644 additions and 1 deletions
--- a/docs/cache-backend-architecture-blueprint.md
+++ b/docs/cache-backend-architecture-blueprint.md
@@ -0,0 +1,259 @@
+# Cache Backend Architecture Blueprint
+
+Date: 2026-05-03
+
+## Purpose
+
+This blueprint defines an optional backend architecture for sophisticated
+knowledge systems built on top of `markitect-tool`.
+
+It is a research-lab architecture: powerful enough to support cached ASTs,
+advanced query backends, agent memory, and access control, but separated from
+the slim core so one-off CLI use stays fast and simple.
+
+## Architectural Boundary
+
+The core package owns:
+
+- Markdown parsing
+- document contracts
+- simple selectors
+- deterministic transforms and generation primitives
+- unified diagnostics
+
+The optional backend fabric owns:
+
+- persistent snapshots
+- indexes
+- advanced query adapters
+- memory/context packages
+- policy enforcement
+- provenance records
+- trace and performance metadata
+
+The core must be able to run without the backend fabric.
+
+## Conceptual Layers
+
+```text
+Markdown files
+  -> Core parser and contract layer
+  -> Content-addressed document snapshots
+  -> Index fabric
+      -> AST/JSON index
+      -> full-text index
+      -> vector/semantic index
+      -> analytical/index export
+  -> Query adapter registry
+      -> simple selectors
+      -> JSONPath
+      -> SQL/FTS
+      -> vector/hybrid retrieval
+  -> Context package registry
+      -> activated working sets
+      -> memory namespaces
+      -> agent-ready context bundles
+  -> Access policy gateway
+      -> labels/ACL/ReBAC/ABAC
+      -> result filtering and denial diagnostics
+  -> Provenance and observability
+```
+
+## Core Interfaces
+
+### Snapshot Backend
+
+Responsible for durable parsed-document snapshots.
+
+Minimum protocol:
+
+```text
+put_document(source_path, content, parse_options) -> snapshot_id
+get_snapshot(snapshot_id) -> DocumentSnapshot
+resolve_source(source_path) -> latest snapshot_id
+diff_snapshot(old_id, new_id) -> SnapshotDiff
+```
+
+Snapshot identity should include:
+
+- source content hash
+- parser version
+- parse options
+- contract version when relevant
+
+### Index Backend
+
+Responsible for derived lookup structures.
+
+Minimum protocol:
+
+```text
+capabilities() -> IndexCapabilities
+build(snapshot_ids, options) -> IndexBuildResult
+refresh(changed_snapshots) -> IndexBuildResult
+query(request) -> QueryResult
+explain(request) -> QueryPlan
+```
+
+Capabilities should include:
+
+- `jsonpath`
+- `sql`
+- `fts`
+- `vector`
+- `hybrid`
+- `inline_tokens`
+- `section_graph`
+- `policy_pushdown`
+
+### Query Adapter
+
+Translates a stable Markitect query request into backend-specific execution.
+
+Minimum protocol:
+
+```text
+name
+supports(selector_or_query, target) -> bool
+execute(document_or_backend, request) -> QueryResult
+explain(request) -> QueryExplanation
+```
+
+Adapters must return a common result envelope:
+
+- kind
+- path
+- value
+- text
+- source location
+- snapshot id
+- provenance
+- policy decision
+- backend metadata
+
+### Context Package Registry
+
+Responsible for agent-ready working memory.
+
+Minimum protocol:
+
+```text
+create_package(query_or_manifest, budget, policy) -> context_package_id
+activate(package_id, thread_or_workspace) -> activation_id
+deactivate(activation_id)
+refresh(package_id) -> package_id
+explain(package_id) -> ContextPackageReport
+```
+
+Context packages should include:
+
+- included source spans
+- summary layers
+- token estimates
+- provenance
+- freshness
+- policy labels
+- retrieval recipe
+- cache keys
+
+### Access Policy Gateway
+
+Responsible for authorization and redaction before results leave a backend.
+
+Minimum protocol:
+
+```text
+authorize(subject, action, object, context) -> PolicyDecision
+filter_results(subject, action, results, context) -> FilteredResults
+explain_decision(decision_id) -> PolicyExplanation
+```
+
+Policy should support a ladder:
+
+1. Labels and trust zones.
+2. File/path ACLs.
+3. Relationship-based access control.
+4. Attribute/rule-based policies.
+5. External authorization services.
+
+## Suggested Backend Manifest
+
+Backends should register through a Markdown/YAML manifest:
+
+````markdown
+# Local SQLite Backend
+
+```yaml markitect-backend
+id: local-sqlite-cache
+kind: cache-backend
+capabilities:
+  - snapshots
+  - json
+  - fts
+  - sql
+  - provenance
+storage:
+  engine: sqlite
+  path: .markitect/cache/index.sqlite
+policy:
+  mode: labels
+```
+````
+
+## CLI Direction
+
+The first backend CLI should be explicit:
+
+```text
+mkt cache init
+mkt cache build <path>
+mkt cache status
+mkt cache query <selector-or-query> --backend <name>
+mkt ast show <file>
+mkt ast query <file> <jsonpath>
+mkt context pack <manifest-or-query>
+mkt context activate <package-id>
+mkt policy check <subject> <action> <object>
+```
+
+Do not hide persistence behind `mkt query`. The user should know when the tool
+is querying live files versus a persistent backend.
+
+## Recommended First Stack
+
+Start with:
+
+- content hashes in Python standard library
+- SQLite for snapshot metadata, JSON, and FTS5
+- JSONPath as an optional extra
+- local filesystem cache directory
+- simple label policy
+- provenance tables
+
+Defer:
+
+- vector search until text/structure cache works
+- external authorization engines until local policy model is stable
+- MCP server exposure until resources/tools are secure and explainable
+- distributed cache until local invalidation is boring
+
+## Security Notes
+
+Cached data becomes a new data exposure surface.
+
+Minimum requirements before secure use:
+
+- cache location is explicit
+- cache entries know source path and content hash
+- policy mode is visible
+- query results report policy filtering
+- context packages list what they include
+- destructive cache operations require explicit command
+- no backend silently sends document content to a network service
+
+## Architecture Decision
+
+Implement the backend fabric after deterministic transform/composition
+primitives are underway, but before serious caching, agent memory, or advanced
+query backends. This lets WP-0003 continue while reserving a clean path for the
+research-lab track.
--- a/docs/query-extraction.md
+++ b/docs/query-extraction.md
@@ -0,0 +1,76 @@
+# Query And Extraction
+
+Date: 2026-05-03
+
+## Purpose
+
+The first query layer keeps selection close to the structured Markdown model.
+It is intentionally small and deterministic. JSONPath or another query backend
+can be added later behind the same API if the simple selector language becomes
+too limited.
+
+## CLI
+
+```text
+mkt query <document.md> <selector> [--format json|yaml|text]
+mkt extract <document.md> <selector> [--format text|json|yaml]
+```
+
+`query` returns structured matches. `extract` returns textual content from the
+matches.
+
+## Selectors
+
+Supported targets:
+
+- `document`, `$`, or `.`: full parsed document
+- `frontmatter`: YAML frontmatter
+- `headings`: heading objects
+- `sections`: heading-led sections
+- `blocks`: parsed content blocks
+- `metrics`: document and section metrics
+
+Supported path examples:
+
+```text
+frontmatter.status
+frontmatter.owner.name
+metrics.document.words
+metrics.document.sections
+```
+
+Supported filters:
+
+```text
+headings[level=2]
+headings[text=Decision]
+headings[text~=decision]
+sections[heading=Context]
+sections[heading~=risk]
+sections[contains=problem]
+sections[contains~=PROBLEM]
+blocks[type=paragraph]
+blocks[contains~=follow-up]
+```
+
+`=` is exact and case-sensitive. `~=` is substring matching and
+case-insensitive.
+
+## Current Boundary
+
+This is not a full query language. It covers practical extraction from the
+current parser model:
+
+- frontmatter values
+- headings
+- sections
+- content blocks
+- metrics
+
+Future query backend work should preserve this simple surface and add optional
+adapters rather than forcing every user into a heavier language.
+
+Advanced query and cache backends are tracked in:
+
+- `docs/cache-backend-architecture-blueprint.md`
+- `workplans/MKTT-WP-0007-advanced-query-and-local-index-backend.md`
--- a/docs/research-lab-cache-backend-research.md
+++ b/docs/research-lab-cache-backend-research.md
@@ -0,0 +1,248 @@
+# Research Lab: Sophisticated Cache Backends
+
+Date: 2026-05-03
+
+## Purpose
+
+This research note explores how `markitect-tool` can keep its slim,
+markdown-native core while allowing sophisticated optional backends for cached
+ASTs, structured indexes, multiple query paradigms, agent working memory, and
+access-controlled knowledge systems.
+
+The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
+the useful insight behind it: once Markdown has been parsed into a trustworthy
+structure, many higher-value operations become possible if that structure can
+be cached, indexed, queried, reactivated, and governed.
+
+## Research Signals
+
+### Content Addressing And Reproducibility
+
+Git's object model is a practical reference for content-addressed storage:
+content is written to an object database and retrieved by a hash-derived key.
+Bazel remote caching similarly separates action outputs from metadata so work
+can be reused when inputs are unchanged.
+
+Relevance:
+
+- Parse results should be keyed by content hash, parser version, and options.
+- Derived indexes should declare their input snapshots and invalidation rules.
+- Reproducible context packages need stable object identities.
+
+Sources:
+
+- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
+- https://docs.bazel.build/versions/main/remote-caching.html
+
+### Structured Query And AST Introspection
+
+JSONPath is now standardized as RFC 9535. It defines selection and extraction
+over JSON values and has security considerations around implementation behavior
+and query construction. This makes it a good optional backend for power users
+who need raw access to the full parsed structure.
+
+SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
+supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
+external-content tables. These features map well to Markdown sections and
+blocks while keeping local-first operation.
+
+Relevance:
+
+- Keep the current simple selector API as the common surface.
+- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
+- Add SQLite as the first local persistent index backend.
+- Keep AST introspection as a debugging and research-lab capability, not as
+  the default user interface.
+
+Sources:
+
+- https://www.rfc-editor.org/rfc/rfc9535.html
+- https://www.sqlite.org/json1.html
+- https://www.sqlite.org/fts5.html
+
+### Columnar And Vector Backends
+
+Apache Arrow defines a language-independent columnar memory format. DuckDB is
+strong for local analytical SQL over structured data. Vector databases such as
+Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
+
+Relevance:
+
+- The core should not depend on any vector database.
+- Index backends should advertise capabilities: text search, SQL, JSONPath,
+  vector search, hybrid retrieval, analytical scans.
+- Vector indexes should store provenance back to document, section, and content
+  hash, not just opaque chunks.
+
+Sources:
+
+- https://arrow.apache.org/docs/format/Columnar.html
+- https://duckdb.org/docs/stable/data/json/overview
+- https://qdrant.tech/documentation/manage-data/collections/
+- https://docs.lancedb.com/
+- https://github.com/pgvector/pgvector
+
+### Agent Context And Working Memory
+
+The Model Context Protocol gives a useful integration model: resources provide
+context/data, tools execute actions, and roots define filesystem or URI
+boundaries. LangChain/LangGraph memory docs distinguish short-term,
+thread-scoped memory from long-term, namespace-scoped memory, and further split
+long-term memory into semantic, episodic, and procedural forms. The MemGPT
+paper frames memory management as an operating-system-like problem for LLMs.
+
+Relevance:
+
+- Markitect context caches should be namespace-scoped and explicitly
+  activatable.
+- A context package should carry text, structure, provenance, policy, freshness,
+  and token-budget metadata.
+- Agents should be able to drop and reactivate working context by stable id.
+- Memory writes need hot-path and background modes.
+
+Sources:
+
+- https://modelcontextprotocol.io/specification/2025-06-18
+- https://docs.langchain.com/oss/python/concepts/memory
+- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
+- https://arxiv.org/abs/2310.08560
+
+### Provenance, Observability, And Debuggability
+
+W3C PROV provides a vocabulary for entities, activities, agents, and
+derivations. OpenTelemetry traces provide spans and attributes for observing
+distributed or multi-step operations.
+
+Relevance:
+
+- Cache entries should record what produced them.
+- Query results should be explainable: source file, section, content hash,
+  index backend, policy decision, and transform chain.
+- Agent context packs should be auditable.
+
+Sources:
+
+- https://www.w3.org/TR/prov-overview/
+- https://opentelemetry.io/docs/concepts/signals/traces/
+
+### Access Control: Fluid To Rigid
+
+Zanzibar demonstrates a relationship-based authorization model at large scale.
+OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
+available as productized systems. OPA/Rego and Cedar provide policy evaluation
+models for attribute and rule-based decisions.
+
+Relevance:
+
+- Markitect should support a fluid-to-rigid access-control ladder.
+- Local labs can start with labels and trust scopes.
+- Secure deployments need policy checks before query results are returned to
+  agents or users.
+- Policy decisions should be part of the diagnostic and provenance trail.
+
+Sources:
+
+- https://www.usenix.org/conference/atc19/presentation/pang
+- https://openfga.dev/docs/concepts
+- https://www.openpolicyagent.org/docs/policy-language
+- https://docs.cedarpolicy.com/
+
+## Main Finding
+
+The optional backend should be a **capability-oriented cache fabric**, not a
+single database choice.
+
+The slim core should continue to parse, validate, query, transform, and
+generate Markdown without persistent infrastructure. The research-lab backend
+should attach through explicit interfaces:
+
+- content-addressed snapshots
+- index manifests
+- query adapter registry
+- memory/context package registry
+- access policy gateway
+- provenance and trace records
+
+That lets the project support spontaneous one-time tool use and also grow into
+high-performance, agentic, security-sensitive knowledge systems.
+
+## Most Promising Use Cases
+
+### UC-RL-001: AST Introspection And JSONPath Backend
+
+Expose raw parsed documents for advanced users:
+
+- `mkt ast show`
+- `mkt ast query --backend jsonpath`
+- raw token and inline query support
+- adapter path from simple selectors to JSONPath where possible
+
+Utility:
+
+- debugging parser behavior
+- developing transforms
+- power-user structural extraction
+- migration path for legacy `markitect-main` AST workflows
+
+### UC-RL-002: Local Persistent Knowledge Index
+
+Build a local cache/index for a repo or document collection:
+
+- content-addressed document snapshots
+- SQLite JSON tables for structure
+- SQLite FTS5 for section/block text search
+- optional DuckDB/Arrow export for analytical work
+- incremental refresh based on content hashes
+
+Utility:
+
+- fast repeated queries
+- search across many Markdown files
+- offline/local-first knowledge work
+- foundation for batch transforms and generation pipelines
+
+### UC-RL-003: Agent Working Memory Cache
+
+Create activatable context packages for LLM agents:
+
+- namespace-scoped memories
+- short-term working sets and long-term caches
+- semantic/episodic/procedural memory categories
+- drop/reactivate by stable id
+- token-budget-aware context assembly
+- provenance and freshness metadata
+
+Utility:
+
+- efficient agent work across long projects
+- reusable context packs for recurring tasks
+- controlled memory updates and recall
+- bridge from Markitect documents to agent infrastructure
+
+### UC-RL-004: Access-Controlled Knowledge Gateway
+
+Add policy enforcement to cached retrieval:
+
+- labels/trust zones for local use
+- ACL/ReBAC/ABAC adapters for stricter systems
+- policy-aware query result filtering
+- decision logs and diagnostics
+- secure context packages for LLM use
+
+Utility:
+
+- enterprise and IT-security use cases
+- multi-tenant knowledge bases
+- agent access control
+- auditable data exposure
+
+## Design Principles
+
+- The core remains infrastructure-free.
+- Backends are optional and capability-declared.
+- Every cached object is content-addressed or provenance-addressed.
+- Query adapters return the same match/result envelope.
+- Policy is checked before data leaves a backend boundary.
+- Context packages are explicit, droppable, and reactivatable.
+- LLM memory is data with provenance, not invisible prompt residue.
+- Experimental backends belong behind stable contracts.
--- a/docs/workplan-planning-map.md
+++ b/docs/workplan-planning-map.md
@@ -0,0 +1,68 @@
+# Workplan Planning Map
+
+Date: 2026-05-03
+
+## Purpose
+
+This document captures the current sequencing and priority view for
+`markitect-tool` workplans.
+
+State Hub currently supports workstream dependency edges, but it does not yet
+have native workstream priority/order fields and does not ingest dependency
+metadata from workplan frontmatter. Until that exists, this file and the
+workplan frontmatter are the repo source of truth; State Hub dependency edges
+and descriptions mirror the operational view.
+
+## Priority Scale
+
+| Priority | Meaning |
+| --- | --- |
+| `P0` | Current mainline work. |
+| `P1` | Next enabling architecture or implementation work. |
+| `P2` | High-value follow-on work, start when trigger conditions are met. |
+| `P3` | Research-lab or security-sensitive extension work. |
+| `complete` | Finished foundation or completed decision work. |
+
+## Current Ordering
+
+| Workplan | Priority | Status | Depends On | Current View |
+| --- | --- | --- | --- | --- |
+| `MKTT-WP-0001` | complete | done | none | Repository foundation is complete. |
+| `MKTT-WP-0002` | complete | done | `MKTT-WP-0001` | Legacy scope extraction is complete. |
+| `MKTT-WP-0004` | complete | done | `MKTT-WP-0001`, `MKTT-WP-0002` | Contract framework is complete and informs later validation/generation work. |
+| `MKTT-WP-0003` | P0 | active | `MKTT-WP-0001`, `MKTT-WP-0002`, `MKTT-WP-0004` | Mainline implementation. Continue with P3.5 transform/compose/include. |
+| `MKTT-WP-0006` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T005` | Start after transform/composition shape is clear and before serious cache work. |
+| `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. |
+| `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. |
+| `MKTT-WP-0009` | P2 | todo | `MKTT-WP-0006` | Establish access-control gateway before security-sensitive cache/context use. |
+| `MKTT-WP-0008` | P3 | todo | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory cache after backend and policy floor are available. |
+
+## Dependency Notes
+
+The most important nuance is `MKTT-WP-0006`: it should not wait for every task
+in `MKTT-WP-0003`, because it should shape cache architecture before `P3.7`.
+It should wait until `MKTT-WP-0003-T005` gives transform/composition enough
+shape to know what cached identities and invalidation rules must preserve.
+
+This is a mixed task/workstream dependency. State Hub does not currently model
+that natively.
+
+## State Hub Mirror
+
+Native State Hub dependency edges should mirror the whole-workstream
+dependencies:
+
+- `MKTT-WP-0002 -> MKTT-WP-0001`
+- `MKTT-WP-0004 -> MKTT-WP-0001`
+- `MKTT-WP-0004 -> MKTT-WP-0002`
+- `MKTT-WP-0003 -> MKTT-WP-0001`
+- `MKTT-WP-0003 -> MKTT-WP-0002`
+- `MKTT-WP-0003 -> MKTT-WP-0004`
+- `MKTT-WP-0006 -> MKTT-WP-0004`
+- `MKTT-WP-0007 -> MKTT-WP-0006`
+- `MKTT-WP-0005 -> MKTT-WP-0003`
+- `MKTT-WP-0005 -> MKTT-WP-0004`
+- `MKTT-WP-0009 -> MKTT-WP-0006`
+- `MKTT-WP-0008 -> MKTT-WP-0006`
+- `MKTT-WP-0008 -> MKTT-WP-0007`
+- `MKTT-WP-0008 -> MKTT-WP-0009`