Workplan dependencies and prio for text research lab workplans

This commit is contained in:
2026-05-04 00:12:07 +02:00
parent 4fc891c076
commit 6f0facd744
18 changed files with 1644 additions and 1 deletions

View File

@@ -0,0 +1,259 @@
# Cache Backend Architecture Blueprint
Date: 2026-05-03
## Purpose
This blueprint defines an optional backend architecture for sophisticated
knowledge systems built on top of `markitect-tool`.
It is a research-lab architecture: powerful enough to support cached ASTs,
advanced query backends, agent memory, and access control, but separated from
the slim core so one-off CLI use stays fast and simple.
## Architectural Boundary
The core package owns:
- Markdown parsing
- document contracts
- simple selectors
- deterministic transforms and generation primitives
- unified diagnostics
The optional backend fabric owns:
- persistent snapshots
- indexes
- advanced query adapters
- memory/context packages
- policy enforcement
- provenance records
- trace and performance metadata
The core must be able to run without the backend fabric.
## Conceptual Layers
```text
Markdown files
-> Core parser and contract layer
-> Content-addressed document snapshots
-> Index fabric
-> AST/JSON index
-> full-text index
-> vector/semantic index
-> analytical/index export
-> Query adapter registry
-> simple selectors
-> JSONPath
-> SQL/FTS
-> vector/hybrid retrieval
-> Context package registry
-> activated working sets
-> memory namespaces
-> agent-ready context bundles
-> Access policy gateway
-> labels/ACL/ReBAC/ABAC
-> result filtering and denial diagnostics
-> Provenance and observability
```
## Core Interfaces
### Snapshot Backend
Responsible for durable parsed-document snapshots.
Minimum protocol:
```text
put_document(source_path, content, parse_options) -> snapshot_id
get_snapshot(snapshot_id) -> DocumentSnapshot
resolve_source(source_path) -> latest snapshot_id
diff_snapshot(old_id, new_id) -> SnapshotDiff
```
Snapshot identity should include:
- source content hash
- parser version
- parse options
- contract version when relevant
### Index Backend
Responsible for derived lookup structures.
Minimum protocol:
```text
capabilities() -> IndexCapabilities
build(snapshot_ids, options) -> IndexBuildResult
refresh(changed_snapshots) -> IndexBuildResult
query(request) -> QueryResult
explain(request) -> QueryPlan
```
Capabilities should include:
- `jsonpath`
- `sql`
- `fts`
- `vector`
- `hybrid`
- `inline_tokens`
- `section_graph`
- `policy_pushdown`
### Query Adapter
Translates a stable Markitect query request into backend-specific execution.
Minimum protocol:
```text
name
supports(selector_or_query, target) -> bool
execute(document_or_backend, request) -> QueryResult
explain(request) -> QueryExplanation
```
Adapters must return a common result envelope:
- kind
- path
- value
- text
- source location
- snapshot id
- provenance
- policy decision
- backend metadata
### Context Package Registry
Responsible for agent-ready working memory.
Minimum protocol:
```text
create_package(query_or_manifest, budget, policy) -> context_package_id
activate(package_id, thread_or_workspace) -> activation_id
deactivate(activation_id)
refresh(package_id) -> package_id
explain(package_id) -> ContextPackageReport
```
Context packages should include:
- included source spans
- summary layers
- token estimates
- provenance
- freshness
- policy labels
- retrieval recipe
- cache keys
### Access Policy Gateway
Responsible for authorization and redaction before results leave a backend.
Minimum protocol:
```text
authorize(subject, action, object, context) -> PolicyDecision
filter_results(subject, action, results, context) -> FilteredResults
explain_decision(decision_id) -> PolicyExplanation
```
Policy should support a ladder:
1. Labels and trust zones.
2. File/path ACLs.
3. Relationship-based access control.
4. Attribute/rule-based policies.
5. External authorization services.
## Suggested Backend Manifest
Backends should register through a Markdown/YAML manifest:
````markdown
# Local SQLite Backend
```yaml markitect-backend
id: local-sqlite-cache
kind: cache-backend
capabilities:
- snapshots
- json
- fts
- sql
- provenance
storage:
engine: sqlite
path: .markitect/cache/index.sqlite
policy:
mode: labels
```
````
## CLI Direction
The first backend CLI should be explicit:
```text
mkt cache init
mkt cache build <path>
mkt cache status
mkt cache query <selector-or-query> --backend <name>
mkt ast show <file>
mkt ast query <file> <jsonpath>
mkt context pack <manifest-or-query>
mkt context activate <package-id>
mkt policy check <subject> <action> <object>
```
Do not hide persistence behind `mkt query`. The user should know when the tool
is querying live files versus a persistent backend.
## Recommended First Stack
Start with:
- content hashes in Python standard library
- SQLite for snapshot metadata, JSON, and FTS5
- JSONPath as an optional extra
- local filesystem cache directory
- simple label policy
- provenance tables
Defer:
- vector search until text/structure cache works
- external authorization engines until local policy model is stable
- MCP server exposure until resources/tools are secure and explainable
- distributed cache until local invalidation is boring
## Security Notes
Cached data becomes a new data exposure surface.
Minimum requirements before secure use:
- cache location is explicit
- cache entries know source path and content hash
- policy mode is visible
- query results report policy filtering
- context packages list what they include
- destructive cache operations require explicit command
- no backend silently sends document content to a network service
## Architecture Decision
Implement the backend fabric after deterministic transform/composition
primitives are underway, but before serious caching, agent memory, or advanced
query backends. This lets WP-0003 continue while reserving a clean path for the
research-lab track.

76
docs/query-extraction.md Normal file
View File

@@ -0,0 +1,76 @@
# Query And Extraction
Date: 2026-05-03
## Purpose
The first query layer keeps selection close to the structured Markdown model.
It is intentionally small and deterministic. JSONPath or another query backend
can be added later behind the same API if the simple selector language becomes
too limited.
## CLI
```text
mkt query <document.md> <selector> [--format json|yaml|text]
mkt extract <document.md> <selector> [--format text|json|yaml]
```
`query` returns structured matches. `extract` returns textual content from the
matches.
## Selectors
Supported targets:
- `document`, `$`, or `.`: full parsed document
- `frontmatter`: YAML frontmatter
- `headings`: heading objects
- `sections`: heading-led sections
- `blocks`: parsed content blocks
- `metrics`: document and section metrics
Supported path examples:
```text
frontmatter.status
frontmatter.owner.name
metrics.document.words
metrics.document.sections
```
Supported filters:
```text
headings[level=2]
headings[text=Decision]
headings[text~=decision]
sections[heading=Context]
sections[heading~=risk]
sections[contains=problem]
sections[contains~=PROBLEM]
blocks[type=paragraph]
blocks[contains~=follow-up]
```
`=` is exact and case-sensitive. `~=` is substring matching and
case-insensitive.
## Current Boundary
This is not a full query language. It covers practical extraction from the
current parser model:
- frontmatter values
- headings
- sections
- content blocks
- metrics
Future query backend work should preserve this simple surface and add optional
adapters rather than forcing every user into a heavier language.
Advanced query and cache backends are tracked in:
- `docs/cache-backend-architecture-blueprint.md`
- `workplans/MKTT-WP-0007-advanced-query-and-local-index-backend.md`

View File

@@ -0,0 +1,248 @@
# Research Lab: Sophisticated Cache Backends
Date: 2026-05-03
## Purpose
This research note explores how `markitect-tool` can keep its slim,
markdown-native core while allowing sophisticated optional backends for cached
ASTs, structured indexes, multiple query paradigms, agent working memory, and
access-controlled knowledge systems.
The goal is not to rebuild `markitect-main` wholesale. The goal is to preserve
the useful insight behind it: once Markdown has been parsed into a trustworthy
structure, many higher-value operations become possible if that structure can
be cached, indexed, queried, reactivated, and governed.
## Research Signals
### Content Addressing And Reproducibility
Git's object model is a practical reference for content-addressed storage:
content is written to an object database and retrieved by a hash-derived key.
Bazel remote caching similarly separates action outputs from metadata so work
can be reused when inputs are unchanged.
Relevance:
- Parse results should be keyed by content hash, parser version, and options.
- Derived indexes should declare their input snapshots and invalidation rules.
- Reproducible context packages need stable object identities.
Sources:
- https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
- https://docs.bazel.build/versions/main/remote-caching.html
### Structured Query And AST Introspection
JSONPath is now standardized as RFC 9535. It defines selection and extraction
over JSON values and has security considerations around implementation behavior
and query construction. This makes it a good optional backend for power users
who need raw access to the full parsed structure.
SQLite JSON and FTS5 provide a pragmatic local storage/query foundation. FTS5
supports full-text search, relevance ranking, phrase/prefix/NEAR queries, and
external-content tables. These features map well to Markdown sections and
blocks while keeping local-first operation.
Relevance:
- Keep the current simple selector API as the common surface.
- Add JSONPath over `Document.to_dict()` as an optional advanced adapter.
- Add SQLite as the first local persistent index backend.
- Keep AST introspection as a debugging and research-lab capability, not as
the default user interface.
Sources:
- https://www.rfc-editor.org/rfc/rfc9535.html
- https://www.sqlite.org/json1.html
- https://www.sqlite.org/fts5.html
### Columnar And Vector Backends
Apache Arrow defines a language-independent columnar memory format. DuckDB is
strong for local analytical SQL over structured data. Vector databases such as
Qdrant, LanceDB, and pgvector provide semantic retrieval primitives.
Relevance:
- The core should not depend on any vector database.
- Index backends should advertise capabilities: text search, SQL, JSONPath,
vector search, hybrid retrieval, analytical scans.
- Vector indexes should store provenance back to document, section, and content
hash, not just opaque chunks.
Sources:
- https://arrow.apache.org/docs/format/Columnar.html
- https://duckdb.org/docs/stable/data/json/overview
- https://qdrant.tech/documentation/manage-data/collections/
- https://docs.lancedb.com/
- https://github.com/pgvector/pgvector
### Agent Context And Working Memory
The Model Context Protocol gives a useful integration model: resources provide
context/data, tools execute actions, and roots define filesystem or URI
boundaries. LangChain/LangGraph memory docs distinguish short-term,
thread-scoped memory from long-term, namespace-scoped memory, and further split
long-term memory into semantic, episodic, and procedural forms. The MemGPT
paper frames memory management as an operating-system-like problem for LLMs.
Relevance:
- Markitect context caches should be namespace-scoped and explicitly
activatable.
- A context package should carry text, structure, provenance, policy, freshness,
and token-budget metadata.
- Agents should be able to drop and reactivate working context by stable id.
- Memory writes need hot-path and background modes.
Sources:
- https://modelcontextprotocol.io/specification/2025-06-18
- https://docs.langchain.com/oss/python/concepts/memory
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
- https://arxiv.org/abs/2310.08560
### Provenance, Observability, And Debuggability
W3C PROV provides a vocabulary for entities, activities, agents, and
derivations. OpenTelemetry traces provide spans and attributes for observing
distributed or multi-step operations.
Relevance:
- Cache entries should record what produced them.
- Query results should be explainable: source file, section, content hash,
index backend, policy decision, and transform chain.
- Agent context packs should be auditable.
Sources:
- https://www.w3.org/TR/prov-overview/
- https://opentelemetry.io/docs/concepts/signals/traces/
### Access Control: Fluid To Rigid
Zanzibar demonstrates a relationship-based authorization model at large scale.
OpenFGA and SpiceDB make Zanzibar-style relationship-based access control
available as productized systems. OPA/Rego and Cedar provide policy evaluation
models for attribute and rule-based decisions.
Relevance:
- Markitect should support a fluid-to-rigid access-control ladder.
- Local labs can start with labels and trust scopes.
- Secure deployments need policy checks before query results are returned to
agents or users.
- Policy decisions should be part of the diagnostic and provenance trail.
Sources:
- https://www.usenix.org/conference/atc19/presentation/pang
- https://openfga.dev/docs/concepts
- https://www.openpolicyagent.org/docs/policy-language
- https://docs.cedarpolicy.com/
## Main Finding
The optional backend should be a **capability-oriented cache fabric**, not a
single database choice.
The slim core should continue to parse, validate, query, transform, and
generate Markdown without persistent infrastructure. The research-lab backend
should attach through explicit interfaces:
- content-addressed snapshots
- index manifests
- query adapter registry
- memory/context package registry
- access policy gateway
- provenance and trace records
That lets the project support spontaneous one-time tool use and also grow into
high-performance, agentic, security-sensitive knowledge systems.
## Most Promising Use Cases
### UC-RL-001: AST Introspection And JSONPath Backend
Expose raw parsed documents for advanced users:
- `mkt ast show`
- `mkt ast query --backend jsonpath`
- raw token and inline query support
- adapter path from simple selectors to JSONPath where possible
Utility:
- debugging parser behavior
- developing transforms
- power-user structural extraction
- migration path for legacy `markitect-main` AST workflows
### UC-RL-002: Local Persistent Knowledge Index
Build a local cache/index for a repo or document collection:
- content-addressed document snapshots
- SQLite JSON tables for structure
- SQLite FTS5 for section/block text search
- optional DuckDB/Arrow export for analytical work
- incremental refresh based on content hashes
Utility:
- fast repeated queries
- search across many Markdown files
- offline/local-first knowledge work
- foundation for batch transforms and generation pipelines
### UC-RL-003: Agent Working Memory Cache
Create activatable context packages for LLM agents:
- namespace-scoped memories
- short-term working sets and long-term caches
- semantic/episodic/procedural memory categories
- drop/reactivate by stable id
- token-budget-aware context assembly
- provenance and freshness metadata
Utility:
- efficient agent work across long projects
- reusable context packs for recurring tasks
- controlled memory updates and recall
- bridge from Markitect documents to agent infrastructure
### UC-RL-004: Access-Controlled Knowledge Gateway
Add policy enforcement to cached retrieval:
- labels/trust zones for local use
- ACL/ReBAC/ABAC adapters for stricter systems
- policy-aware query result filtering
- decision logs and diagnostics
- secure context packages for LLM use
Utility:
- enterprise and IT-security use cases
- multi-tenant knowledge bases
- agent access control
- auditable data exposure
## Design Principles
- The core remains infrastructure-free.
- Backends are optional and capability-declared.
- Every cached object is content-addressed or provenance-addressed.
- Query adapters return the same match/result envelope.
- Policy is checked before data leaves a backend boundary.
- Context packages are explicit, droppable, and reactivatable.
- LLM memory is data with provenance, not invisible prompt residue.
- Experimental backends belong behind stable contracts.

View File

@@ -0,0 +1,68 @@
# Workplan Planning Map
Date: 2026-05-03
## Purpose
This document captures the current sequencing and priority view for
`markitect-tool` workplans.
State Hub currently supports workstream dependency edges, but it does not yet
have native workstream priority/order fields and does not ingest dependency
metadata from workplan frontmatter. Until that exists, this file and the
workplan frontmatter are the repo source of truth; State Hub dependency edges
and descriptions mirror the operational view.
## Priority Scale
| Priority | Meaning |
| --- | --- |
| `P0` | Current mainline work. |
| `P1` | Next enabling architecture or implementation work. |
| `P2` | High-value follow-on work, start when trigger conditions are met. |
| `P3` | Research-lab or security-sensitive extension work. |
| `complete` | Finished foundation or completed decision work. |
## Current Ordering
| Workplan | Priority | Status | Depends On | Current View |
| --- | --- | --- | --- | --- |
| `MKTT-WP-0001` | complete | done | none | Repository foundation is complete. |
| `MKTT-WP-0002` | complete | done | `MKTT-WP-0001` | Legacy scope extraction is complete. |
| `MKTT-WP-0004` | complete | done | `MKTT-WP-0001`, `MKTT-WP-0002` | Contract framework is complete and informs later validation/generation work. |
| `MKTT-WP-0003` | P0 | active | `MKTT-WP-0001`, `MKTT-WP-0002`, `MKTT-WP-0004` | Mainline implementation. Continue with P3.5 transform/compose/include. |
| `MKTT-WP-0006` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T005` | Start after transform/composition shape is clear and before serious cache work. |
| `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. |
| `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. |
| `MKTT-WP-0009` | P2 | todo | `MKTT-WP-0006` | Establish access-control gateway before security-sensitive cache/context use. |
| `MKTT-WP-0008` | P3 | todo | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory cache after backend and policy floor are available. |
## Dependency Notes
The most important nuance is `MKTT-WP-0006`: it should not wait for every task
in `MKTT-WP-0003`, because it should shape cache architecture before `P3.7`.
It should wait until `MKTT-WP-0003-T005` gives transform/composition enough
shape to know what cached identities and invalidation rules must preserve.
This is a mixed task/workstream dependency. State Hub does not currently model
that natively.
## State Hub Mirror
Native State Hub dependency edges should mirror the whole-workstream
dependencies:
- `MKTT-WP-0002 -> MKTT-WP-0001`
- `MKTT-WP-0004 -> MKTT-WP-0001`
- `MKTT-WP-0004 -> MKTT-WP-0002`
- `MKTT-WP-0003 -> MKTT-WP-0001`
- `MKTT-WP-0003 -> MKTT-WP-0002`
- `MKTT-WP-0003 -> MKTT-WP-0004`
- `MKTT-WP-0006 -> MKTT-WP-0004`
- `MKTT-WP-0007 -> MKTT-WP-0006`
- `MKTT-WP-0005 -> MKTT-WP-0003`
- `MKTT-WP-0005 -> MKTT-WP-0004`
- `MKTT-WP-0009 -> MKTT-WP-0006`
- `MKTT-WP-0008 -> MKTT-WP-0006`
- `MKTT-WP-0008 -> MKTT-WP-0007`
- `MKTT-WP-0008 -> MKTT-WP-0009`

View File

@@ -21,6 +21,12 @@ from markitect_tool.contract import (
validate_contract_file,
)
from markitect_tool.diagnostics import Diagnostic, SourceLocation
from markitect_tool.query import (
InvalidQueryError,
QueryMatch,
extract_document,
query_document,
)
from markitect_tool.schema import (
MarkdownSchema,
SchemaValidationResult,
@@ -55,4 +61,8 @@ __all__ = [
"validate_contract_file",
"Diagnostic",
"SourceLocation",
"InvalidQueryError",
"QueryMatch",
"extract_document",
"query_document",
]

View File

@@ -16,6 +16,7 @@ from markitect_tool.contract import (
load_contract_file,
validate_contract,
)
from markitect_tool.query import InvalidQueryError, extract_document, query_document
from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema
@@ -65,6 +66,60 @@ def metrics(file: Path, output_format: str) -> None:
_emit_metrics(data, output_format)
@main.command()
@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.argument("selector")
@click.option(
"--format",
"output_format",
type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
default="json",
show_default=True,
)
def query(file: Path, selector: str, output_format: str) -> None:
"""Query structured Markdown content with a small selector."""
document = parse_markdown_file(file)
try:
matches = query_document(document, selector)
except InvalidQueryError as exc:
raise click.ClickException(str(exc)) from exc
data = {
"selector": selector,
"document_path": str(file),
"count": len(matches),
"matches": [match.to_dict() for match in matches],
}
_emit_query(data, output_format)
@main.command()
@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.argument("selector")
@click.option(
"--format",
"output_format",
type=click.Choice(["text", "json", "yaml"], case_sensitive=False),
default="text",
show_default=True,
)
def extract(file: Path, selector: str, output_format: str) -> None:
"""Extract text or Markdown content from structured Markdown."""
document = parse_markdown_file(file)
try:
items = extract_document(document, selector)
except InvalidQueryError as exc:
raise click.ClickException(str(exc)) from exc
data = {
"selector": selector,
"document_path": str(file),
"count": len(items),
"items": items,
}
_emit_extract(data, output_format)
@main.command()
@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option(
@@ -214,5 +269,28 @@ def _emit_metrics(data: dict, output_format: str) -> None:
)
def _emit_query(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))
elif output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))
else:
click.echo(f"{data['count']} match(es)")
for match in data["matches"]:
location = f":{match['line']}" if match.get("line") else ""
click.echo(f"- {match['kind']} {match['path']}{location}")
if match.get("text"):
click.echo(f" {match['text'].splitlines()[0]}")
def _emit_extract(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))
elif output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))
else:
click.echo("\n\n".join(data["items"]))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,15 @@
"""Query and extraction helpers for parsed Markdown documents."""
from markitect_tool.query.engine import (
InvalidQueryError,
QueryMatch,
extract_document,
query_document,
)
__all__ = [
"InvalidQueryError",
"QueryMatch",
"extract_document",
"query_document",
]

View File

@@ -0,0 +1,242 @@
"""Small selector engine for structured Markdown documents."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Any
from markitect_tool.contract import collect_metrics
from markitect_tool.core import ContentBlock, Document, Heading, Section
class InvalidQueryError(ValueError):
"""Raised when a selector cannot be parsed or evaluated."""
@dataclass(frozen=True)
class QueryMatch:
"""One match returned by a selector."""
kind: str
path: str
value: Any
text: str | None = None
line: int | None = None
def to_dict(self) -> dict[str, Any]:
data = {
"kind": self.kind,
"path": self.path,
"value": self.value,
"text": self.text,
"line": self.line,
}
return {key: value for key, value in data.items() if value is not None}
@dataclass(frozen=True)
class _Selector:
target: str
path: list[str]
filters: dict[str, str]
def query_document(document: Document, selector: str) -> list[QueryMatch]:
"""Query a parsed document with a small Markitect selector."""
parsed = _parse_selector(selector)
if parsed.target in {"document", "$", "."}:
return [QueryMatch(kind="document", path="$", value=document.to_dict())]
if parsed.target == "frontmatter":
return _query_mapping(document.frontmatter, parsed.path, "frontmatter", "$.frontmatter")
if parsed.target == "headings":
return _query_headings(document.headings, parsed.filters)
if parsed.target == "sections":
return _query_sections(document.sections, parsed.filters)
if parsed.target == "blocks":
return _query_blocks(document.blocks, parsed.filters)
if parsed.target == "metrics":
return _query_mapping(collect_metrics(document).to_dict(), parsed.path, "metrics", "$.metrics")
raise InvalidQueryError(f"Unsupported selector target `{parsed.target}`")
def extract_document(document: Document, selector: str) -> list[str]:
"""Extract text content from query matches."""
extracted: list[str] = []
for match in query_document(document, selector):
if match.text is not None:
extracted.append(match.text)
elif isinstance(match.value, str):
extracted.append(match.value)
elif isinstance(match.value, int | float | bool):
extracted.append(str(match.value))
return extracted
def _parse_selector(selector: str) -> _Selector:
raw = selector.strip()
if not raw:
raise InvalidQueryError("Selector cannot be empty")
filters: dict[str, str] = {}
base = raw
if "[" in raw or "]" in raw:
if not raw.endswith("]") or raw.count("[") != 1 or raw.count("]") != 1:
raise InvalidQueryError(f"Invalid selector filter syntax `{selector}`")
base, raw_filter = raw[:-1].split("[", 1)
filters = _parse_filters(raw_filter)
parts = [part for part in base.split(".") if part]
if not parts:
return _Selector(target="document", path=[], filters=filters)
return _Selector(target=parts[0], path=parts[1:], filters=filters)
def _parse_filters(raw_filter: str) -> dict[str, str]:
filters: dict[str, str] = {}
for raw_part in raw_filter.split(","):
part = raw_part.strip()
if not part:
continue
operator = "~=" if "~=" in part else "="
if operator not in part:
raise InvalidQueryError(f"Invalid filter `{part}`")
key, value = part.split(operator, 1)
key = key.strip()
if operator == "~=":
key = f"{key}~"
if not key:
raise InvalidQueryError(f"Invalid filter `{part}`")
filters[key] = _strip_quotes(value.strip())
return filters
def _query_mapping(
mapping: dict[str, Any],
path: list[str],
kind: str,
root_path: str,
) -> list[QueryMatch]:
if not path:
return [QueryMatch(kind=kind, path=root_path, value=mapping)]
value: Any = mapping
current_path = root_path
for part in path:
current_path = f"{current_path}.{part}"
if isinstance(value, dict) and part in value:
value = value[part]
else:
return []
return [QueryMatch(kind=kind, path=current_path, value=value, text=_text_value(value))]
def _query_headings(headings: list[Heading], filters: dict[str, str]) -> list[QueryMatch]:
matches: list[QueryMatch] = []
for index, heading in enumerate(headings):
if not _match_heading(heading, filters):
continue
matches.append(
QueryMatch(
kind="heading",
path=f"$.headings[{index}]",
value=heading.to_dict(),
text=f"{'#' * heading.level} {heading.text}",
line=heading.line,
)
)
return matches
def _query_sections(sections: list[Section], filters: dict[str, str]) -> list[QueryMatch]:
matches: list[QueryMatch] = []
for index, section in enumerate(sections):
if not _match_section(section, filters):
continue
matches.append(
QueryMatch(
kind="section",
path=f"$.sections[{index}]",
value=section.to_dict(),
text=_section_markdown(section),
line=section.heading.line,
)
)
return matches
def _query_blocks(blocks: list[ContentBlock], filters: dict[str, str]) -> list[QueryMatch]:
matches: list[QueryMatch] = []
for index, block in enumerate(blocks):
if not _match_block(block, filters):
continue
matches.append(
QueryMatch(
kind="block",
path=f"$.blocks[{index}]",
value=block.to_dict(),
text=block.text,
line=block.line_start,
)
)
return matches
def _match_heading(heading: Heading, filters: dict[str, str]) -> bool:
for key, expected in filters.items():
if key == "level" and str(heading.level) != expected:
return False
if key in {"text", "heading", "title"} and heading.text != expected:
return False
if key in {"text~", "heading~", "title~"} and expected.lower() not in heading.text.lower():
return False
return True
def _match_section(section: Section, filters: dict[str, str]) -> bool:
section_text = "\n".join(block.text for block in section.blocks if block.text)
for key, expected in filters.items():
if key == "level" and str(section.heading.level) != expected:
return False
if key in {"heading", "title", "text"} and section.heading.text != expected:
return False
if key in {"heading~", "title~", "text~"} and expected.lower() not in section.heading.text.lower():
return False
if key == "contains" and expected not in section_text:
return False
if key == "contains~" and expected.lower() not in section_text.lower():
return False
return True
def _match_block(block: ContentBlock, filters: dict[str, str]) -> bool:
for key, expected in filters.items():
if key == "type" and block.type != expected:
return False
if key == "contains" and expected not in block.text:
return False
if key == "contains~" and expected.lower() not in block.text.lower():
return False
return True
def _section_markdown(section: Section) -> str:
lines = [f"{'#' * section.heading.level} {section.heading.text}"]
for block in section.blocks:
if block.text:
lines.extend(["", block.text])
return "\n".join(lines).strip()
def _strip_quotes(value: str) -> str:
if len(value) >= 2 and value[0] == value[-1] and value[0] in {'"', "'"}:
return value[1:-1]
return value
def _text_value(value: Any) -> str | None:
if isinstance(value, str):
return value
if isinstance(value, int | float | bool):
return str(value)
return None

View File

@@ -0,0 +1,148 @@
from pathlib import Path
import pytest
from click.testing import CliRunner
from markitect_tool.cli import main
from markitect_tool.core import parse_markdown
from markitect_tool.query import InvalidQueryError, extract_document, query_document
QUERY_DOC = """---
document_type: adr
status: accepted
nested:
owner: Platform
---
# Use Query Selectors
## Context
The problem is that authors need predictable extraction from Markdown.
## Decision
We will use a small selector language before adopting a larger query backend.
## Consequences
- Queries remain readable.
- Extraction can feed later transforms.
"""
def test_query_frontmatter_path():
document = parse_markdown(QUERY_DOC)
matches = query_document(document, "frontmatter.nested.owner")
assert len(matches) == 1
assert matches[0].kind == "frontmatter"
assert matches[0].path == "$.frontmatter.nested.owner"
assert matches[0].text == "Platform"
def test_query_headings_by_level():
document = parse_markdown(QUERY_DOC)
matches = query_document(document, "headings[level=2]")
assert [match.value["text"] for match in matches] == [
"Context",
"Decision",
"Consequences",
]
def test_query_sections_by_exact_heading():
document = parse_markdown(QUERY_DOC)
matches = query_document(document, "sections[heading=Decision]")
assert len(matches) == 1
assert matches[0].kind == "section"
assert matches[0].line == 14
assert matches[0].text.startswith("## Decision")
assert "small selector language" in matches[0].text
def test_query_sections_by_case_insensitive_contains():
document = parse_markdown(QUERY_DOC)
matches = query_document(document, "sections[contains~=TRANSFORMS]")
assert [match.value["heading"]["text"] for match in matches] == ["Consequences"]
def test_query_blocks_by_type():
document = parse_markdown(QUERY_DOC)
matches = query_document(document, "blocks[type=bullet_list]")
assert len(matches) == 1
assert "Queries remain readable" in matches[0].text
def test_query_metrics_path():
document = parse_markdown(QUERY_DOC)
matches = query_document(document, "metrics.document.sections")
assert matches[0].value == 4
assert matches[0].text == "4"
def test_extract_document_returns_textual_matches():
document = parse_markdown(QUERY_DOC)
extracted = extract_document(document, "sections[heading=Context]")
assert extracted == [
"## Context\n\nThe problem is that authors need predictable extraction from Markdown."
]
def test_invalid_query_reports_error():
document = parse_markdown(QUERY_DOC)
with pytest.raises(InvalidQueryError):
query_document(document, "sections[heading")
def test_mkt_query_outputs_json(tmp_path: Path):
source = tmp_path / "doc.md"
source.write_text(QUERY_DOC, encoding="utf-8")
result = CliRunner().invoke(
main, ["query", str(source), "sections[heading=Decision]"]
)
assert result.exit_code == 0
assert '"count": 1' in result.output
assert "Decision" in result.output
def test_mkt_query_outputs_text(tmp_path: Path):
source = tmp_path / "doc.md"
source.write_text(QUERY_DOC, encoding="utf-8")
result = CliRunner().invoke(
main, ["query", str(source), "headings[level=2]", "--format", "text"]
)
assert result.exit_code == 0
assert "3 match(es)" in result.output
assert "## Context" in result.output
def test_mkt_extract_outputs_text(tmp_path: Path):
source = tmp_path / "doc.md"
source.write_text(QUERY_DOC, encoding="utf-8")
result = CliRunner().invoke(
main, ["extract", str(source), "frontmatter.status"]
)
assert result.exit_code == 0
assert result.output.strip() == "accepted"

View File

@@ -6,6 +6,9 @@ domain: markitect
status: done
owner: markitect-tool
topic_slug: markitect
planning_priority: complete
planning_order: 10
depends_on_workplans: []
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "4d405d74-faec-440e-873e-692ff9ca96e7"

View File

@@ -6,6 +6,10 @@ domain: markitect
status: done
owner: markitect-tool
topic_slug: markitect
planning_priority: complete
planning_order: 20
depends_on_workplans:
- MKTT-WP-0001
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "0fe54d2c-d579-4b03-a647-7a15bb835893"

View File

@@ -6,6 +6,12 @@ domain: markitect
status: active
owner: markitect-tool
topic_slug: markitect
planning_priority: P0
planning_order: 40
depends_on_workplans:
- MKTT-WP-0001
- MKTT-WP-0002
- MKTT-WP-0004
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "9fefb57d-985e-4125-8daf-03554844f45e"
@@ -67,7 +73,7 @@ validation, structured violations, `mkt validate`, and `mkt schema validate`.
```task
id: MKTT-WP-0003-T004
status: todo
status: done
priority: medium
state_hub_task_id: "e4f72218-601e-488f-a5df-171b91a747d2"
```
@@ -75,6 +81,10 @@ state_hub_task_id: "e4f72218-601e-488f-a5df-171b91a747d2"
Implement FR-030 and FR-031 over the structured representation. Start with a
small query language or JSONPath-like adapter only if it remains simple.
Initial implementation complete for simple selectors over frontmatter,
headings, sections, blocks, and metrics, with API access plus `mkt query` and
`mkt extract`.
## P3.5 - Implement transform, compose, and include primitives
```task

View File

@@ -6,6 +6,11 @@ domain: markitect
status: done
owner: markitect-tool
topic_slug: markitect
planning_priority: complete
planning_order: 30
depends_on_workplans:
- MKTT-WP-0001
- MKTT-WP-0002
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "558787e1-d287-46a5-9214-634e8b90a858"

View File

@@ -6,6 +6,11 @@ domain: markitect
status: todo
owner: markitect-tool
topic_slug: markitect
planning_priority: P2
planning_order: 70
depends_on_workplans:
- MKTT-WP-0003
- MKTT-WP-0004
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "7918687e-2364-46b1-ab7e-65aa77cb8449"

View File

@@ -0,0 +1,133 @@
---
id: MKTT-WP-0006
type: workplan
title: "Optional Cache Backend Architecture Core"
domain: markitect
status: todo
owner: markitect-tool
topic_slug: markitect
planning_priority: P1
planning_order: 50
depends_on_workplans:
- MKTT-WP-0004
depends_on_tasks:
- MKTT-WP-0003-T005
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "0c585f8a-5c7e-4c89-b785-5b0089180256"
---
# MKTT-WP-0006: Optional Cache Backend Architecture Core
## Purpose
Create the optional backend fabric that lets `markitect-tool` attach cached
ASTs, indexes, query adapters, context packages, and policy gateways without
making persistent infrastructure mandatory for core CLI use.
## Background
Research and architecture are captured in:
- `docs/research-lab-cache-backend-research.md`
- `docs/cache-backend-architecture-blueprint.md`
## Decision
Do not start this before the current deterministic transform/composition slice
has enough shape to show what cache invalidation must preserve. Start it before
WP-0003 P3.7 caching becomes implementation work.
## P6.1 - Define backend capability model
```task
id: MKTT-WP-0006-T001
status: todo
priority: high
state_hub_task_id: "8c04f146-942c-45b8-9a7b-3bd61916aa4b"
```
Define capability names, backend manifests, and compatibility checks for:
- snapshots
- JSON/AST query
- full-text search
- SQL
- vector/hybrid search
- context packages
- policy enforcement
- provenance
## P6.2 - Define snapshot model and content identity
```task
id: MKTT-WP-0006-T002
status: todo
priority: high
state_hub_task_id: "5debc135-908a-47ed-ba15-564610970e38"
```
Specify content-addressed document snapshots keyed by source content hash,
parser version, parse options, and contract version where relevant.
## P6.3 - Define backend interfaces
```task
id: MKTT-WP-0006-T003
status: todo
priority: high
state_hub_task_id: "a3e37112-1197-4f6f-8de8-7b3067ef060e"
```
Add protocol classes for snapshot backends, index backends, query adapters,
context package registries, and access policy gateways.
## P6.4 - Implement local backend registry
```task
id: MKTT-WP-0006-T004
status: todo
priority: medium
state_hub_task_id: "6c9b8765-4d14-436d-a2c9-c028a31aaade"
```
Load backend manifests from project config and expose registered capabilities
without importing optional dependencies unless needed.
## P6.5 - Add provenance envelope
```task
id: MKTT-WP-0006-T005
status: todo
priority: medium
state_hub_task_id: "7b551eae-99c8-4c8a-b781-18d59d318707"
```
Add provenance metadata shared by snapshots, query results, context packages,
and diagnostics.
## P6.6 - Add CLI scaffolding
```task
id: MKTT-WP-0006-T006
status: todo
priority: medium
state_hub_task_id: "921e589c-8b0d-4eeb-8834-4a4c6c73da65"
```
Add read-only commands:
```text
mkt backend list
mkt backend inspect <id>
mkt cache status
```
No persistent write behavior is required in this task.
## Exit Criteria
- Core CLI still works without any backend.
- Backends can declare capabilities in Markdown/YAML manifests.
- Query and future cache work can target backend interfaces.
- Provenance is represented consistently.

View File

@@ -0,0 +1,125 @@
---
id: MKTT-WP-0007
type: workplan
title: "Advanced Query and Local Index Backend"
domain: markitect
status: todo
owner: markitect-tool
topic_slug: markitect
planning_priority: P2
planning_order: 60
depends_on_workplans:
- MKTT-WP-0006
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "d61a82e4-651a-4df2-944a-9ff996b2e1f6"
---
# MKTT-WP-0007: Advanced Query and Local Index Backend
## Purpose
Implement the first practical backend use case: cached AST introspection,
JSONPath querying, SQLite metadata, and FTS5 search over Markdown documents.
## P7.1 - Implement local snapshot store
```task
id: MKTT-WP-0007-T001
status: todo
priority: high
state_hub_task_id: "8894a9a4-586c-457b-b4e6-add8276ff5f2"
```
Persist parsed document snapshots and source metadata in a local cache
directory.
## P7.2 - Add AST introspection commands
```task
id: MKTT-WP-0007-T002
status: todo
priority: high
state_hub_task_id: "fb9eaa9d-5c20-49a9-a7a6-acae28ac5e20"
```
Add:
```text
mkt ast show <file>
mkt ast stats <file>
```
Use the current parsed document and token model. Do not require cache presence
for single-file use.
## P7.3 - Add optional JSONPath query adapter
```task
id: MKTT-WP-0007-T003
status: todo
priority: high
state_hub_task_id: "a7b46b32-f322-4fe0-a6fb-60b0b823593c"
```
Support JSONPath over `Document.to_dict()` behind an optional dependency and
shared query result envelope.
## P7.4 - Build SQLite metadata and JSON index
```task
id: MKTT-WP-0007-T004
status: todo
priority: medium
state_hub_task_id: "479f11a3-4ab4-451b-991c-7f143f2bffea"
```
Persist source files, content hashes, frontmatter, headings, sections, blocks,
and metrics in SQLite.
## P7.5 - Add FTS5 section/block search
```task
id: MKTT-WP-0007-T005
status: todo
priority: medium
state_hub_task_id: "0f03e9be-b6f0-4e4b-8220-3bbf638a892b"
```
Add full-text search over section and block text with source spans and
relevance ranking.
## P7.6 - Add incremental refresh
```task
id: MKTT-WP-0007-T006
status: todo
priority: medium
state_hub_task_id: "7d9472e6-0716-435b-866c-d2c66ad786cf"
```
Refresh only changed files based on content hash and parser version.
## P7.7 - Add local index CLI
```task
id: MKTT-WP-0007-T007
status: todo
priority: high
state_hub_task_id: "35cc63ff-3723-43d5-aaf6-f9312efa0f4b"
```
Add:
```text
mkt cache init
mkt cache build <path>
mkt cache query <selector-or-query>
mkt search <text>
```
## Exit Criteria
- Legacy AST/JSONPath value is recovered as an optional backend.
- Local repeated queries are faster and explainable.
- Simple selectors still work without cache.

View File

@@ -0,0 +1,109 @@
---
id: MKTT-WP-0008
type: workplan
title: "Agent Working Memory Context Cache"
domain: markitect
status: todo
owner: markitect-tool
topic_slug: markitect
planning_priority: P3
planning_order: 90
depends_on_workplans:
- MKTT-WP-0006
- MKTT-WP-0007
- MKTT-WP-0009
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "6269f338-4f5c-40ee-90e5-0371f5c3874c"
---
# MKTT-WP-0008: Agent Working Memory Context Cache
## Purpose
Create activatable context packages that let agents drop, reactivate, and
reuse project knowledge efficiently while preserving provenance and policy
metadata.
## P8.1 - Define context package schema
```task
id: MKTT-WP-0008-T001
status: todo
priority: high
state_hub_task_id: "21ee9c37-4add-4886-bd03-a7bb4b20e957"
```
Represent source spans, summaries, token estimates, freshness, provenance,
policy labels, and retrieval recipes.
## P8.2 - Implement package creation from queries
```task
id: MKTT-WP-0008-T002
status: todo
priority: high
state_hub_task_id: "4df06b93-13ce-41fb-a8c3-f04d4ad9d752"
```
Create context packages from simple selectors, cached search results, or
manifest files.
## P8.3 - Implement activation lifecycle
```task
id: MKTT-WP-0008-T003
status: todo
priority: medium
state_hub_task_id: "9f3d9792-d655-482d-bae0-262df5fc0136"
```
Support activate, deactivate, refresh, and explain operations for a package.
## P8.4 - Add memory namespaces
```task
id: MKTT-WP-0008-T004
status: todo
priority: medium
state_hub_task_id: "2d090494-0e10-44cd-8e2d-c418d7530b27"
```
Support project, user, agent, thread, and task namespaces without hard-coding
any external agent platform.
## P8.5 - Add summary layers
```task
id: MKTT-WP-0008-T005
status: todo
priority: medium
state_hub_task_id: "4d1cf970-3d6d-4bd5-8da9-ec2399aa7efe"
```
Support deterministic summaries first, then optional LLM-generated summaries
through an injected adapter.
## P8.6 - Add CLI commands
```task
id: MKTT-WP-0008-T006
status: todo
priority: medium
state_hub_task_id: "2f18386c-9d2c-4af1-b8e2-75cb487c1692"
```
Add:
```text
mkt context pack <manifest-or-query>
mkt context activate <package-id>
mkt context explain <package-id>
mkt context refresh <package-id>
```
## Exit Criteria
- Agents can reactivate project context by stable id.
- Context packages show included sources and token budgets.
- Memory writes remain explicit and inspectable.

View File

@@ -0,0 +1,105 @@
---
id: MKTT-WP-0009
type: workplan
title: "Access-Controlled Knowledge Gateway"
domain: markitect
status: todo
owner: markitect-tool
topic_slug: markitect
planning_priority: P2
planning_order: 80
depends_on_workplans:
- MKTT-WP-0006
created: "2026-05-03"
updated: "2026-05-03"
state_hub_workstream_id: "f36acbc9-881d-46f2-9181-67de228df0c2"
---
# MKTT-WP-0009: Access-Controlled Knowledge Gateway
## Purpose
Add a policy boundary for cached retrieval and context packages so Markitect can
support security-sensitive knowledge systems and agent workflows.
## P9.1 - Define access-control ladder
```task
id: MKTT-WP-0009-T001
status: todo
priority: high
state_hub_task_id: "acf240b4-7210-4ee5-90b6-2f2fe1438439"
```
Specify supported modes:
- labels and trust zones
- path/file ACLs
- relationship-based access control
- attribute/rule-based policies
- external policy engines
## P9.2 - Implement local label policy
```task
id: MKTT-WP-0009-T002
status: todo
priority: high
state_hub_task_id: "9eb589d2-82f2-4282-9af0-3958826d397d"
```
Start with local policy labels and diagnostics for denied or redacted results.
## P9.3 - Add policy-aware query filtering
```task
id: MKTT-WP-0009-T003
status: todo
priority: high
state_hub_task_id: "d78ab623-c472-4b24-ad84-08464b574886"
```
Ensure results are filtered before leaving the backend boundary. Result
metadata must report whether policy filtering occurred.
## P9.4 - Add relationship policy adapter design
```task
id: MKTT-WP-0009-T004
status: todo
priority: medium
state_hub_task_id: "bd4c2b7a-6eac-4845-b5c8-9f9c64946f0c"
```
Design an adapter boundary for Zanzibar/OpenFGA/SpiceDB-style relationship
checks without binding the core package to any one service.
## P9.5 - Add rule policy adapter design
```task
id: MKTT-WP-0009-T005
status: todo
priority: medium
state_hub_task_id: "752f1962-e83c-44cc-a1c1-0f89a4ea2a90"
```
Design an adapter boundary for OPA/Rego and Cedar-style rule policies.
## P9.6 - Add decision logs and explainability
```task
id: MKTT-WP-0009-T006
status: todo
priority: medium
state_hub_task_id: "990f01fa-5008-4871-a887-1c6ab4375605"
```
Record policy decisions with subject, action, object, context, decision,
reason, and provenance.
## Exit Criteria
- Local caches can operate in an explicit policy mode.
- Query and context package results are policy-aware.
- More rigid authorization engines can attach later without replacing the
query/cache framework.