Refresh planning layer for backend fabric

This commit is contained in:
2026-05-04 03:25:26 +02:00
parent 3f08a27a24
commit b1577d90db
10 changed files with 797 additions and 2 deletions

View File

@@ -75,6 +75,37 @@ The resulting `snapshot_id` is a stable hash over those identity fields. This
lets future AST, JSONPath, FTS, SQL, vector, policy, and context-package
backends invalidate derived data without guessing what changed.
## Refresh Planning
Before WP-0007 writes a local SQLite index, the backend fabric provides a
read-only refresh planner. The planner compares current Markdown files with a
portable snapshot-state inventory and reports:
- unchanged files
- files that need hashing
- files that need parsing
- files that need indexing
- files that only need metadata updates
- deleted sources
- dependency-invalidated dependents
The planner uses a cheap-first strategy:
1. Compare path, size, mtime, parser version, parse options hash, and contract
hash.
2. If cheap metadata is unchanged, skip hashing, parsing, and indexing.
3. If metadata changed, either mark the file for hash/parse/index or, with
`--verify-hashes`, hash only those changed candidates to avoid parsing when
content is unchanged.
4. Use dependency edges to invalidate direct and transitive dependents.
This gives WP-0007 a performance contract before the storage engine exists.
```bash
mkt backend refresh-plan docs --state examples/backend-state/snapshot-state.yaml
mkt backend refresh-plan docs --state .markitect/cache/snapshots.yaml --verify-hashes
```
## Provenance Envelope
The shared backend provenance envelope records:
@@ -113,6 +144,7 @@ Read-only inspection commands:
mkt backend list --path examples/backends
mkt backend inspect local-sqlite-cache --path examples/backends --require snapshots --require provenance
mkt backend snapshot-id docs/content-references.md
mkt backend refresh-plan docs --state examples/backend-state/snapshot-state.yaml
```
The existing `mkt cache status` remains the lightweight file-manifest change

View File

@@ -33,7 +33,7 @@ and descriptions mirror the operational view.
| `MKTT-WP-0003` | complete | done | `MKTT-WP-0001`, `MKTT-WP-0002`, `MKTT-WP-0004` | Core toolkit implementation is complete. |
| `MKTT-WP-0006` | complete | done | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T005` | Optional backend fabric is complete: manifests, capabilities, snapshot identity, interfaces, registry, provenance, and read-only CLI scaffolding. |
| `MKTT-WP-0010` | complete | done | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T006` | Content references, processors, explode/implode, weave/tangle, content classes, and migration examples are complete as the first WP-0010 extension layer. |
| `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. |
| `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. Preliminary refresh planning is in place as the performance contract. |
| `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. |
| `MKTT-WP-0011` | P2 | todo | `MKTT-WP-0003`; task-level triggers: `MKTT-WP-0010-T001`, `MKTT-WP-0010-T005` | Declarative Markdown dataflow workflows: source extraction, deterministic/assisted processing, and multi-output generation. |
| `MKTT-WP-0009` | P2 | todo | `MKTT-WP-0006` | Establish access-control gateway before security-sensitive cache/context use. |

View File

@@ -0,0 +1,11 @@
snapshots:
- path: docs/content-references.md
size: 0
mtime_ns: 0
content_hash: sha256:example
snapshot_id: snapshot:example
indexed: true
dependencies:
- source_id: snapshot:example
target: examples/references/standard/clauses.md
kind: reference

View File

@@ -43,15 +43,21 @@ from markitect_tool.backend import (
ContextPackageRegistry,
DependencyEdge,
DocumentSnapshot,
EMPTY_PARSE_OPTIONS_HASH,
IndexBackend,
ProcessorResultStore,
ProvenanceEnvelope,
QueryAdapter,
SnapshotPlanEntry,
SnapshotRefreshPlan,
SnapshotBackend,
SnapshotIdentity,
SnapshotState,
capability_check,
load_backend_manifest,
load_backend_registry,
load_snapshot_state_file,
plan_snapshot_refresh,
snapshot_identity_for_file,
)
from markitect_tool.content_class import (
@@ -194,15 +200,21 @@ __all__ = [
"ContextPackageRegistry",
"DependencyEdge",
"DocumentSnapshot",
"EMPTY_PARSE_OPTIONS_HASH",
"IndexBackend",
"ProcessorResultStore",
"ProvenanceEnvelope",
"QueryAdapter",
"SnapshotPlanEntry",
"SnapshotRefreshPlan",
"SnapshotBackend",
"SnapshotIdentity",
"SnapshotState",
"capability_check",
"load_backend_manifest",
"load_backend_registry",
"load_snapshot_state_file",
"plan_snapshot_refresh",
"snapshot_identity_for_file",
"ClassCompositionResult",
"ContentClass",

View File

@@ -9,6 +9,7 @@ from markitect_tool.backend.engine import (
BackendRegistryError,
DependencyEdge,
DocumentSnapshot,
EMPTY_PARSE_OPTIONS_HASH,
ProvenanceEnvelope,
SnapshotIdentity,
capability_check,
@@ -16,6 +17,13 @@ from markitect_tool.backend.engine import (
load_backend_registry,
snapshot_identity_for_file,
)
from markitect_tool.backend.planning import (
SnapshotPlanEntry,
SnapshotRefreshPlan,
SnapshotState,
load_snapshot_state_file,
plan_snapshot_refresh,
)
from markitect_tool.backend.interfaces import (
AccessPolicyGateway,
ContextPackageRegistry,
@@ -34,12 +42,18 @@ __all__ = [
"BackendRegistryError",
"DependencyEdge",
"DocumentSnapshot",
"EMPTY_PARSE_OPTIONS_HASH",
"ProvenanceEnvelope",
"SnapshotIdentity",
"capability_check",
"load_backend_manifest",
"load_backend_registry",
"snapshot_identity_for_file",
"SnapshotPlanEntry",
"SnapshotRefreshPlan",
"SnapshotState",
"load_snapshot_state_file",
"plan_snapshot_refresh",
"AccessPolicyGateway",
"ContextPackageRegistry",
"IndexBackend",

View File

@@ -32,6 +32,7 @@ BACKEND_CAPABILITIES = {
DEFAULT_BACKEND_PATHS = (".markitect/backends", ".markitect/backend.yaml")
PARSER_ID = "markdown-it-py/commonmark"
PARSER_VERSION = "markitect-tool:1"
EMPTY_PARSE_OPTIONS_HASH = "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a"
class BackendRegistryError(ValueError):
@@ -103,7 +104,7 @@ class SnapshotIdentity:
content_hash: str
parser: str = PARSER_ID
parser_version: str = PARSER_VERSION
parse_options_hash: str = "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
parse_options_hash: str = EMPTY_PARSE_OPTIONS_HASH
contract_hash: str | None = None
@property

View File

@@ -0,0 +1,425 @@
"""Refresh planning for optional snapshot and index backends."""
from __future__ import annotations
import hashlib
import json
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any
import yaml
from markitect_tool.backend.engine import (
DependencyEdge,
EMPTY_PARSE_OPTIONS_HASH,
PARSER_ID,
PARSER_VERSION,
)
from markitect_tool.cache import scan_markdown_files
@dataclass(frozen=True)
class SnapshotState:
"""Previously known source state from a snapshot/index backend."""
path: str
size: int
mtime_ns: int
content_hash: str
snapshot_id: str
parser: str = PARSER_ID
parser_version: str = PARSER_VERSION
parse_options_hash: str = EMPTY_PARSE_OPTIONS_HASH
contract_hash: str | None = None
indexed: bool = True
dependencies: list[DependencyEdge] = field(default_factory=list)
def to_dict(self) -> dict[str, Any]:
data = asdict(self)
data["dependencies"] = [edge.to_dict() for edge in self.dependencies]
return {key: value for key, value in data.items() if value is not None}
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "SnapshotState":
return cls(
path=str(data["path"]),
size=int(data["size"]),
mtime_ns=int(data["mtime_ns"]),
content_hash=str(data["content_hash"]),
snapshot_id=str(data["snapshot_id"]),
parser=str(data.get("parser", PARSER_ID)),
parser_version=str(data.get("parser_version", PARSER_VERSION)),
parse_options_hash=str(
data.get(
"parse_options_hash",
EMPTY_PARSE_OPTIONS_HASH,
)
),
contract_hash=str(data["contract_hash"]) if data.get("contract_hash") is not None else None,
indexed=bool(data.get("indexed", True)),
dependencies=[
_dependency_edge_from_dict(edge)
for edge in data.get("dependencies", [])
if isinstance(edge, dict)
],
)
@dataclass(frozen=True)
class SnapshotPlanEntry:
"""One source-path decision in a refresh plan."""
path: str
actions: list[str]
reason: str
size: int | None = None
mtime_ns: int | None = None
previous_snapshot_id: str | None = None
content_hash: str | None = None
invalidated_by: list[str] = field(default_factory=list)
def to_dict(self) -> dict[str, Any]:
return {key: value for key, value in asdict(self).items() if value not in (None, [], {})}
@dataclass(frozen=True)
class SnapshotRefreshPlan:
"""A cheap-first plan for refreshing snapshots and derived indexes."""
root: str
parser: str
parser_version: str
parse_options_hash: str
contract_hash: str | None
verify_hashes: bool
entries: list[SnapshotPlanEntry]
@property
def unchanged(self) -> list[str]:
return _paths_without_actions(self.entries)
@property
def needs_hash(self) -> list[str]:
return _paths_with_action(self.entries, "hash")
@property
def needs_parse(self) -> list[str]:
return _paths_with_action(self.entries, "parse")
@property
def needs_index(self) -> list[str]:
return _paths_with_action(self.entries, "index")
@property
def needs_metadata_update(self) -> list[str]:
return _paths_with_action(self.entries, "metadata")
@property
def deleted(self) -> list[str]:
return _paths_with_action(self.entries, "delete")
@property
def invalidated(self) -> list[str]:
return sorted(entry.path for entry in self.entries if "invalidate" in entry.actions)
@property
def dirty(self) -> bool:
return any(entry.actions for entry in self.entries)
def to_dict(self) -> dict[str, Any]:
return {
"dirty": self.dirty,
"root": self.root,
"parser": self.parser,
"parser_version": self.parser_version,
"parse_options_hash": self.parse_options_hash,
"contract_hash": self.contract_hash,
"verify_hashes": self.verify_hashes,
"counts": {
"unchanged": len(self.unchanged),
"needs_hash": len(self.needs_hash),
"needs_parse": len(self.needs_parse),
"needs_index": len(self.needs_index),
"needs_metadata_update": len(self.needs_metadata_update),
"deleted": len(self.deleted),
"invalidated": len(self.invalidated),
},
"unchanged": self.unchanged,
"needs_hash": self.needs_hash,
"needs_parse": self.needs_parse,
"needs_index": self.needs_index,
"needs_metadata_update": self.needs_metadata_update,
"deleted": self.deleted,
"invalidated": self.invalidated,
"entries": [entry.to_dict() for entry in self.entries],
}
def plan_snapshot_refresh(
paths: list[str | Path],
*,
previous: list[SnapshotState] | dict[str, SnapshotState] | None = None,
root: str | Path = ".",
recursive: bool = True,
parse_options: dict[str, Any] | None = None,
contract_hash: str | None = None,
verify_hashes: bool = False,
) -> SnapshotRefreshPlan:
"""Plan snapshot/index refresh work using cheap metadata before hashing.
When ``verify_hashes`` is false, files with changed size/mtime are marked
for hash, parse, and index. When true, the planner hashes only those
metadata-changed files so it can avoid parsing when content is unchanged.
"""
root_path = Path(root).resolve()
previous_by_path = _previous_by_path(previous)
parse_options_hash = _hash_mapping(parse_options or {})
current_files = {
_relative(path, root_path): path
for path in scan_markdown_files(paths, recursive=recursive)
}
entries: list[SnapshotPlanEntry] = []
changed_or_deleted: set[str] = set()
for relative_path, file_path in sorted(current_files.items()):
stat = file_path.stat()
known = previous_by_path.get(relative_path)
if known is None:
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=["hash", "parse", "index"],
reason="new_file",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
)
)
changed_or_deleted.add(relative_path)
continue
identity_changed = (
known.parser != PARSER_ID
or known.parser_version != PARSER_VERSION
or known.parse_options_hash != parse_options_hash
or known.contract_hash != contract_hash
)
if identity_changed:
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=["hash", "parse", "index"],
reason="snapshot_identity_parameters_changed",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
previous_snapshot_id=known.snapshot_id,
)
)
changed_or_deleted.add(relative_path)
continue
metadata_same = known.size == stat.st_size and known.mtime_ns == stat.st_mtime_ns
if metadata_same:
actions = [] if known.indexed else ["index"]
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=actions,
reason="unchanged" if not actions else "snapshot_not_indexed",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
previous_snapshot_id=known.snapshot_id,
content_hash=known.content_hash,
)
)
continue
if not verify_hashes:
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=["hash", "parse", "index"],
reason="file_metadata_changed",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
previous_snapshot_id=known.snapshot_id,
)
)
changed_or_deleted.add(relative_path)
continue
current_hash = _hash_file(file_path)
if current_hash == known.content_hash:
actions = ["hash", "metadata"] if known.indexed else ["hash", "metadata", "index"]
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=actions,
reason="file_metadata_changed_content_same",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
previous_snapshot_id=known.snapshot_id,
content_hash=current_hash,
)
)
else:
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=["hash", "parse", "index"],
reason="content_hash_changed",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
previous_snapshot_id=known.snapshot_id,
content_hash=current_hash,
)
)
changed_or_deleted.add(relative_path)
for relative_path, known in sorted(previous_by_path.items()):
if relative_path in current_files:
continue
entries.append(
SnapshotPlanEntry(
path=relative_path,
actions=["delete"],
reason="source_missing",
previous_snapshot_id=known.snapshot_id,
content_hash=known.content_hash,
)
)
changed_or_deleted.add(relative_path)
invalidated = _transitive_dependents(changed_or_deleted, previous_by_path)
if invalidated:
entries = _apply_invalidations(entries, invalidated, changed_or_deleted)
return SnapshotRefreshPlan(
root=str(root_path),
parser=PARSER_ID,
parser_version=PARSER_VERSION,
parse_options_hash=parse_options_hash,
contract_hash=contract_hash,
verify_hashes=verify_hashes,
entries=sorted(entries, key=lambda entry: entry.path),
)
def load_snapshot_state_file(path: str | Path) -> list[SnapshotState]:
"""Load a portable snapshot-state fixture from JSON or YAML."""
state_path = Path(path)
data = yaml.safe_load(state_path.read_text(encoding="utf-8")) or {}
raw_snapshots = data.get("snapshots", data.get("states", data))
if isinstance(raw_snapshots, dict):
raw_snapshots = list(raw_snapshots.values())
if not isinstance(raw_snapshots, list):
raise ValueError("Snapshot state file must contain a `snapshots` list")
return [
SnapshotState.from_dict(item)
for item in raw_snapshots
if isinstance(item, dict)
]
def _previous_by_path(
previous: list[SnapshotState] | dict[str, SnapshotState] | None,
) -> dict[str, SnapshotState]:
if previous is None:
return {}
if isinstance(previous, dict):
return dict(previous)
return {state.path: state for state in previous}
def _dependency_edge_from_dict(data: dict[str, Any]) -> DependencyEdge:
return DependencyEdge(
source_id=str(data["source_id"]),
target=str(data["target"]),
kind=str(data["kind"]),
target_snapshot_id=str(data["target_snapshot_id"]) if data.get("target_snapshot_id") else None,
metadata=dict(data.get("metadata") or {}),
)
def _transitive_dependents(
changed_paths: set[str],
previous_by_path: dict[str, SnapshotState],
) -> dict[str, list[str]]:
reverse: dict[str, set[str]] = {}
for state in previous_by_path.values():
for edge in state.dependencies:
reverse.setdefault(edge.target, set()).add(state.path)
if edge.target_snapshot_id:
reverse.setdefault(edge.target_snapshot_id, set()).add(state.path)
invalidates: dict[str, list[str]] = {}
queue = list(changed_paths)
visited = set(changed_paths)
while queue:
changed = queue.pop(0)
dependents = sorted(reverse.get(changed, set()))
if dependents:
invalidates[changed] = dependents
for dependent in dependents:
if dependent in visited:
continue
visited.add(dependent)
queue.append(dependent)
return invalidates
def _apply_invalidations(
entries: list[SnapshotPlanEntry],
invalidates: dict[str, list[str]],
changed_or_deleted: set[str],
) -> list[SnapshotPlanEntry]:
dependents_by_path: dict[str, list[str]] = {}
for changed_path, dependents in invalidates.items():
for dependent in dependents:
dependents_by_path.setdefault(dependent, []).append(changed_path)
existing = {entry.path: entry for entry in entries}
for dependent, causes in dependents_by_path.items():
if dependent in changed_or_deleted:
continue
entry = existing.get(dependent)
actions = sorted(set((entry.actions if entry else []) + ["invalidate"]))
reason = "dependency_changed" if entry is None or entry.reason == "unchanged" else entry.reason
existing[dependent] = SnapshotPlanEntry(
path=dependent,
actions=actions,
reason=reason,
size=entry.size if entry else None,
mtime_ns=entry.mtime_ns if entry else None,
previous_snapshot_id=entry.previous_snapshot_id if entry else None,
content_hash=entry.content_hash if entry else None,
invalidated_by=sorted(set(causes)),
)
return list(existing.values())
def _paths_with_action(entries: list[SnapshotPlanEntry], action: str) -> list[str]:
return sorted(entry.path for entry in entries if action in entry.actions)
def _paths_without_actions(entries: list[SnapshotPlanEntry]) -> list[str]:
return sorted(entry.path for entry in entries if not entry.actions)
def _relative(path: Path, root: Path) -> str:
resolved = path.resolve()
try:
return resolved.relative_to(root).as_posix()
except ValueError:
return resolved.as_posix()
def _hash_file(path: Path) -> str:
return "sha256:" + hashlib.sha256(path.read_bytes()).hexdigest()
def _hash_mapping(mapping: dict[str, Any]) -> str:
payload = json.dumps(mapping, sort_keys=True, ensure_ascii=False)
return "sha256:" + hashlib.sha256(payload.encode("utf-8")).hexdigest()

View File

@@ -19,6 +19,8 @@ from markitect_tool.cache import (
from markitect_tool.backend import (
BackendRegistryError,
load_backend_registry,
load_snapshot_state_file,
plan_snapshot_refresh,
snapshot_identity_for_file,
)
from markitect_tool.content_class import (
@@ -581,6 +583,71 @@ def backend_snapshot_id(
_emit_snapshot_identity(data, output_format)
@backend.command("refresh-plan")
@click.argument("paths", nargs=-1, required=True, type=click.Path(exists=True, path_type=Path))
@click.option(
"--root",
type=click.Path(exists=True, file_okay=False, path_type=Path),
default=Path("."),
show_default=True,
help="Root used for relative source paths.",
)
@click.option(
"--state",
"state_file",
type=click.Path(exists=True, dir_okay=False, path_type=Path),
help="YAML/JSON snapshot state file from a previous backend run.",
)
@click.option("--no-recursive", is_flag=True, help="Do not recurse into directories.")
@click.option(
"--verify-hashes",
is_flag=True,
help="Hash metadata-changed files to avoid unnecessary parse/index work.",
)
@click.option(
"--parse-option",
"parse_options",
multiple=True,
metavar="KEY=VALUE",
help="Parse option included in the identity comparison.",
)
@click.option("--contract-hash", help="Optional contract hash included in identity comparison.")
@click.option(
"--format",
"output_format",
type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
default="text",
show_default=True,
)
def backend_refresh_plan(
paths: tuple[Path, ...],
root: Path,
state_file: Path | None,
no_recursive: bool,
verify_hashes: bool,
parse_options: tuple[str, ...],
contract_hash: str | None,
output_format: str,
) -> None:
"""Plan cheap-first snapshot and index refresh work."""
try:
previous = load_snapshot_state_file(state_file) if state_file else []
plan = plan_snapshot_refresh(
list(paths),
previous=previous,
root=root,
recursive=not no_recursive,
parse_options=_parse_key_value_options(parse_options),
contract_hash=contract_hash,
verify_hashes=verify_hashes,
)
except (ValueError, TypeError) as exc:
raise click.ClickException(str(exc)) from exc
_emit_refresh_plan(plan.to_dict(), output_format)
raise click.exceptions.Exit(1 if plan.dirty else 0)
@main.group("class")
def class_group() -> None:
"""Resolve deterministic content classes."""
@@ -1238,6 +1305,31 @@ def _emit_snapshot_identity(data: dict, output_format: str) -> None:
click.echo(f"parser: {data['parser']} {data['parser_version']}")
def _emit_refresh_plan(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))
elif output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))
else:
click.echo("dirty" if data["dirty"] else "clean")
counts = data["counts"]
for key in [
"unchanged",
"needs_hash",
"needs_parse",
"needs_index",
"needs_metadata_update",
"deleted",
"invalidated",
]:
click.echo(f"{key}: {counts[key]}")
for entry in data["entries"]:
actions = ",".join(entry.get("actions", [])) or "none"
click.echo(f"- {entry['path']}: {actions} ({entry['reason']})")
if entry.get("invalidated_by"):
click.echo(f" invalidated_by: {', '.join(entry['invalidated_by'])}")
def _emit_content_class_result(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))

View File

@@ -0,0 +1,163 @@
import os
from pathlib import Path
from click.testing import CliRunner
from markitect_tool.backend import (
DependencyEdge,
SnapshotState,
load_snapshot_state_file,
plan_snapshot_refresh,
)
from markitect_tool.cli import main
def test_refresh_plan_marks_all_files_new_without_previous_state(tmp_path: Path):
source = tmp_path / "doc.md"
source.write_text("# Doc\n", encoding="utf-8")
plan = plan_snapshot_refresh([tmp_path], root=tmp_path)
assert plan.dirty
assert plan.needs_hash == ["doc.md"]
assert plan.needs_parse == ["doc.md"]
assert plan.needs_index == ["doc.md"]
def test_refresh_plan_uses_cheap_metadata_for_unchanged_file(tmp_path: Path):
source = tmp_path / "doc.md"
source.write_text("# Doc\n", encoding="utf-8")
stat = source.stat()
previous = SnapshotState(
path="doc.md",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
content_hash="sha256:known",
snapshot_id="snapshot:known",
)
plan = plan_snapshot_refresh([tmp_path], previous=[previous], root=tmp_path)
assert not plan.dirty
assert plan.unchanged == ["doc.md"]
assert plan.needs_hash == []
def test_refresh_plan_can_hash_metadata_changed_file_and_skip_parse_if_content_same(tmp_path: Path):
source = tmp_path / "doc.md"
source.write_text("# Doc\n", encoding="utf-8")
stat = source.stat()
content_hash = _hash_file(source)
previous = SnapshotState(
path="doc.md",
size=stat.st_size,
mtime_ns=stat.st_mtime_ns,
content_hash=content_hash,
snapshot_id="snapshot:known",
)
os.utime(source, ns=(stat.st_atime_ns + 1_000_000_000, stat.st_mtime_ns + 1_000_000_000))
plan = plan_snapshot_refresh(
[tmp_path],
previous=[previous],
root=tmp_path,
verify_hashes=True,
)
assert plan.needs_hash == ["doc.md"]
assert plan.needs_metadata_update == ["doc.md"]
assert plan.needs_parse == []
assert plan.needs_index == []
def test_refresh_plan_invalidates_transitive_dependents(tmp_path: Path):
source = tmp_path / "source.md"
dependent = tmp_path / "dependent.md"
transitive = tmp_path / "transitive.md"
source.write_text("# Source changed\n", encoding="utf-8")
dependent.write_text("# Dependent\n", encoding="utf-8")
transitive.write_text("# Transitive\n", encoding="utf-8")
source_stat = source.stat()
dependent_stat = dependent.stat()
transitive_stat = transitive.stat()
previous = [
SnapshotState(
path="source.md",
size=1,
mtime_ns=1,
content_hash="sha256:old",
snapshot_id="snapshot:source",
),
SnapshotState(
path="dependent.md",
size=dependent_stat.st_size,
mtime_ns=dependent_stat.st_mtime_ns,
content_hash=_hash_file(dependent),
snapshot_id="snapshot:dependent",
dependencies=[
DependencyEdge(source_id="snapshot:dependent", target="source.md", kind="reference")
],
),
SnapshotState(
path="transitive.md",
size=transitive_stat.st_size,
mtime_ns=transitive_stat.st_mtime_ns,
content_hash=_hash_file(transitive),
snapshot_id="snapshot:transitive",
dependencies=[
DependencyEdge(source_id="snapshot:transitive", target="dependent.md", kind="reference")
],
),
]
plan = plan_snapshot_refresh([tmp_path], previous=previous, root=tmp_path)
assert plan.needs_parse == ["source.md"]
assert plan.invalidated == ["dependent.md", "transitive.md"]
entries = {entry.path: entry for entry in plan.entries}
assert entries["dependent.md"].invalidated_by == ["source.md"]
assert entries["transitive.md"].invalidated_by == ["dependent.md"]
assert source_stat.st_size != 1
def test_snapshot_state_file_and_cli_refresh_plan(tmp_path: Path):
source = tmp_path / "doc.md"
state_file = tmp_path / "state.yaml"
source.write_text("# Doc\n", encoding="utf-8")
stat = source.stat()
state_file.write_text(
f"""snapshots:
- path: doc.md
size: {stat.st_size}
mtime_ns: {stat.st_mtime_ns}
content_hash: {_hash_file(source)}
snapshot_id: snapshot:known
""",
encoding="utf-8",
)
states = load_snapshot_state_file(state_file)
result = CliRunner().invoke(
main,
[
"backend",
"refresh-plan",
str(tmp_path),
"--root",
str(tmp_path),
"--state",
str(state_file),
],
)
assert states[0].path == "doc.md"
assert result.exit_code == 0
assert "clean" in result.output
assert "unchanged: 1" in result.output
def _hash_file(path: Path) -> str:
import hashlib
return "sha256:" + hashlib.sha256(path.read_bytes()).hexdigest()

View File

@@ -28,6 +28,25 @@ This backend should later be able to index `MKTT-WP-0010` references, named
regions, chunks, and processor provenance without changing its basic storage
contract.
## Preliminary Refinement - Snapshot Refresh Planning
Implemented before starting the SQLite/index tasks: `SnapshotState`,
`SnapshotPlanEntry`, `SnapshotRefreshPlan`, `plan_snapshot_refresh`,
`load_snapshot_state_file`, and CLI `mkt backend refresh-plan`.
This is the performance contract for WP-0007:
- compare cheap metadata before hashing
- hash only likely-changed files when `--verify-hashes` is requested
- parse only files whose identity/content requires a new snapshot
- index only new, changed, unindexed, or dependency-invalidated entries
- carry direct and transitive dependency invalidation forward from
`DependencyEdge`
- keep refresh planning inspectable through JSON/YAML/text output
The future SQLite store should persist enough state to feed this planner
directly and should report actual refresh work against the same categories.
## P7.1 - Implement local snapshot store
```task
@@ -40,6 +59,14 @@ state_hub_task_id: "8894a9a4-586c-457b-b4e6-add8276ff5f2"
Persist parsed document snapshots and source metadata in a local cache
directory.
Implementation hints:
- Persist `SnapshotState` fields in the snapshot/source tables.
- Store path, size, mtime, content hash, parser id/version, parse options hash,
contract hash, snapshot id, indexed flag, and dependency edges.
- Keep large document/token JSON lazy-loadable so refresh planning does not
pull whole AST payloads into memory.
## P7.2 - Add AST introspection commands
```task
@@ -86,6 +113,14 @@ and metrics in SQLite.
Keep schema extension points for reference edges, named regions, chunks, and
processor outputs.
Implementation hints:
- Use narrow metadata tables for hot refresh decisions.
- Store document/token JSON separately from searchable section/block rows.
- Add indexes on path, content hash, snapshot id, parser version, and unit ids.
- Preserve source spans and content-unit ids from WP-0010 reference/literate
layers.
## P7.5 - Add FTS5 section/block search
```task
@@ -111,6 +146,16 @@ Refresh only changed files based on content hash and parser version.
Include dependency invalidation hooks for future transclusion/reference graphs.
Implementation hints:
- Drive incremental refresh from `SnapshotRefreshPlan`.
- The first pass should use cheap metadata; only hash metadata-changed files.
- With `--verify-hashes`, skip parse/index when content is unchanged and only
update metadata.
- Use reverse dependency edges for direct and transitive invalidation.
- Report planned vs actual counts for hash, parse, index, metadata update,
delete, and invalidation work.
## P7.7 - Add local index CLI
```task