diff --git a/docs/markitect-main-scope-assessment.md b/docs/markitect-main-scope-assessment.md index a830f48..8cd801d 100644 --- a/docs/markitect-main-scope-assessment.md +++ b/docs/markitect-main-scope-assessment.md @@ -47,6 +47,17 @@ consumer needs them through the new library contract: 4. Treat old code as reference material; do not preserve backward compatibility unless the new contract explicitly needs it. 5. Keep database, platform, and domain lifecycle concerns out of this repo. +## Practicality Reassessment + +The first implementation slices intentionally rebuilt the clean parser and JSON +Schema spine. That is necessary but not sufficient. The legacy project already +showed that heading counts and raw structural schemas have limited practical +utility. + +The successor should prioritize a document contract framework before going much +deeper into generic tooling. See `docs/practical-schema-framework-research.md` +and `workplans/MKTT-WP-0004-practical-contract-framework.md`. + ## Initial Architecture Target ```text diff --git a/docs/practical-schema-framework-research.md b/docs/practical-schema-framework-research.md new file mode 100644 index 0000000..56a36a1 --- /dev/null +++ b/docs/practical-schema-framework-research.md @@ -0,0 +1,323 @@ +# Practical Schema Framework Research + +Date: 2026-05-03 + +## Purpose + +This document reassesses `markitect-tool` schema utility before further +implementation. The concern is that pure structural validation, such as heading +counts and min/max depth constraints, is rarely enough to make markdown document +pipelines useful. + +The practical opportunity is to define a stronger framework for markdown-native +document contracts: section specifications, content assertions, form fields, +context-aware rules, LLM-assisted assessments, and high-quality diagnostics. + +## Research Signals + +### Structured Authoring + +DITA is the strongest analogue for typed, reusable textual units. It emphasizes +information typing, semantic markup, modularity, reuse, interchange, and +multiple deliverables from one source. A DITA topic is the unit of authoring and +reuse; topics may be generic or specialized into roles such as concept, task, or +reference. + +Relevance for `markitect-tool`: + +- A markdown document or section should have an explicit information type. +- Information type should imply expected structure and reader purpose. +- Reuse and composition need stable addressing of sections, not only files. +- Specialization is a better mental model than ad hoc schema forks. + +Sources: + +- https://dita-lang.org/dita/archspec/base/basic-concepts +- https://dita-lang.org/dita/archspec/base/introduction-to-dita + +### Document Schemas With Assertions + +DocBook remains relevant because it combines formal document schemas with +Schematron-style assertions. That is the missing layer in many simplistic JSON +Schema approaches: grammar says what may exist; assertions say what must be true +in context. + +Relevance for `markitect-tool`: + +- JSON Schema over `Document.to_dict()` is useful but insufficient. +- We need a second assertion layer for document-specific semantics. +- Diagnostics must point to the document location and rule intention. + +Source: + +- https://docbook.org/schemas/docbook/ + +### Dynamic Form Rules + +JSON Schema supports conditional validation through `dependentRequired`, +`dependentSchemas`, and `if`/`then`/`else`. JSON Forms separates data schema +from UI schema and uses rules to show, hide, enable, or disable UI elements +based on JSON Schema conditions. Form.io’s architecture treats the form schema +as a single source of truth for validation and conditional logic across client +and server. + +Relevance for `markitect-tool`: + +- Forms should be first-class, not bolted onto document generation. +- Field definitions need static validation and dynamic rules. +- Prefill, visibility, requiredness, and calculated values should come from the + same contract used for generation and validation. +- Context data must be explicit and typed. + +Sources: + +- https://json-schema.org/understanding-json-schema/reference/conditionals +- https://jsonforms.io/docs/uischema/rules/ +- https://form.io/features/form-conditional-logic-form-validation/ + +### LLM-Assisted Assessment + +Modern evaluation frameworks treat LLM assessment as explicit graders or +rubrics. OpenAI graders return scores in a 0–1 range and can combine grader +types. Promptfoo’s `llm-rubric` uses explicit criteria and expects structured +judge output with reason, score, and pass/fail. + +Relevance for `markitect-tool`: + +- LLM checks should be declared as assessment rules, not hidden in prompts. +- Deterministic validation and LLM assessment should produce one diagnostic + model. +- Section-level rubrics are more useful than whole-document vague grading. +- The LLM provider must remain external; `markitect-tool` defines contracts and + reports. + +Sources: + +- https://developers.openai.com/api/docs/guides/graders +- https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/ + +### Markdown Structure + +CommonMark gives markdown a well-defined block/inline model. mdast gives a +language-neutral tree vocabulary for Markdown nodes. Both point toward keeping +the parse layer separate from domain/schema layers. + +Relevance for `markitect-tool`: + +- The core document model should stay close to CommonMark/mdast concepts. +- Practical document contracts should sit above the parse model. +- Section addressing, source spans, and block identity are foundational for good + diagnostics. + +Sources: + +- https://spec.commonmark.org/0.31.2/ +- https://github.com/syntax-tree/mdast + +## Terminology Proposal + +| Term | Meaning | +| --- | --- | +| Document | A markdown artifact parsed into frontmatter, blocks, headings, sections, and source spans. | +| Section | A heading-led document region with content, children, source location, and stable identity. | +| Document Type | A named contract for a whole document, e.g. ADR, PRD, invoice letter, support reply, concept note. | +| Section Type | A reusable role for a section, e.g. Context, Decision, Risks, Procedure, Evidence, Conclusion. | +| Field | A typed value expected in frontmatter, inline matter, a section, or an external data record. | +| Form | A field collection with UI hints, validation rules, defaults, dynamic visibility, and calculations. | +| Context | External data available during validation/generation, such as user data, project data, dates, or related entities. | +| Rule | A deterministic condition evaluated against document, fields, context, or pipeline state. | +| Assertion | A claim that must hold for content, usually richer than shape validation. | +| Metric Band | A soft or hard target for size/complexity, such as word count, sentence count, section count, or reading level. | +| Assessment | A deterministic or LLM-assisted evaluation that returns pass/fail, score, reason, and diagnostics. | +| Rubric | A human-readable criterion for LLM-assisted assessment, scoped to a document or section type. | +| Diagnostic | A structured finding with severity, code, message, source location, rule id, and suggested repair. | +| Contract | The full specification for a document type: structure, sections, fields, rules, forms, assertions, rubrics, and outputs. | +| Pipeline | A repeatable sequence of parse, prefill, generate, validate, assess, transform, and compose operations. | + +## Most Relevant Use Cases + +### UC-001: Typed Document Contract + +Define a document type such as ADR, PRD, FRS, workplan, customer letter, or +meeting brief. Specify required sections by semantic role, allowed alternatives, +field requirements, and diagnostics. + +Practical value: + +- Prevents missing critical content. +- Makes generated documents predictable. +- Creates an explicit contract for humans and agents. + +Needed tooling: + +- `mkt contract check --contract ` +- Section matching by heading text, aliases, ids, or section type markers. +- Diagnostics that say which section/field/assertion failed and why. + +### UC-002: Section-Level Content Expectations + +Specify what a section is expected to contain: assertions, required evidence, +forbidden omissions, content patterns, examples, and reviewer prompts. + +Practical value: + +- Moves beyond “has a heading” toward “does the section do its job?” +- Enables review of generated or human-authored text. + +Needed tooling: + +- Deterministic assertions for regex, presence, references, counts, and field + values. +- Optional LLM rubrics for semantic content checks. +- Per-section diagnostic reports. + +### UC-003: Size and Complexity Bands + +Define soft/hard bands for document and section size: words, characters, +sentences, paragraphs, sections, list items, code blocks, and nesting depth. + +Practical value: + +- Controls generation output size. +- Keeps templates from becoming bloated or underdeveloped. +- Helps compare intended vs actual document complexity. + +Needed tooling: + +- Metrics extractor. +- Rule severities: info, warning, error. +- “Too small/too large” diagnostics with actual and target values. + +### UC-004: Form-Backed Markdown Generation + +Define forms that collect or prefill structured fields, then render markdown +documents. Fields may be static, calculated, conditional, or context-derived. + +Practical value: + +- Bridges structured data capture and prose generation. +- Supports repeatable business documents. +- Makes prefill from user/project/entity data explicit. + +Needed tooling: + +- Field schema. +- UI schema or form hints. +- Dynamic rules for requiredness, visibility, defaults, and calculations. +- Template rendering with validation before and after render. + +### UC-005: Context-Aware Validation + +Validate a document against external context: user data, project metadata, +related entities, dates, policy constraints, or canonical terminology. + +Practical value: + +- Checks whether a document is correct for this case, not only generally + well-formed. +- Enables pipelines like personalized letters, compliance reports, and + project-specific workplans. + +Needed tooling: + +- Context object schema. +- Resolvers for local files, JSON/YAML data, and later higher-layer systems. +- Rule expressions that can reference document and context paths. + +### UC-006: LLM-Assisted Section Assessment + +Attach rubrics to section types. Use an external LLM adapter to assess whether a +section satisfies the rubric, returning score, reason, and pass/fail. + +Practical value: + +- Handles semantic checks that deterministic rules cannot. +- Supports review loops for generated text. +- Makes subjective requirements explicit and auditable. + +Needed tooling: + +- Rubric declaration format. +- Provider-neutral assessment request/response models. +- Caching and reproducibility metadata. +- Clear distinction between deterministic errors and model-judged findings. + +### UC-007: Pipeline Diagnostics and Repair Guidance + +Run a document pipeline and get one coherent diagnostic report from parsing, +schema checks, field validation, assertions, generation, composition, and +LLM-assisted assessments. + +Practical value: + +- Makes failures debuggable. +- Helps humans and agents repair documents. +- Avoids scattered errors from unrelated subsystems. + +Needed tooling: + +- Common diagnostic model. +- Error codes and severities. +- Source spans and rule ids. +- Suggested repair text or structured patches when safe. + +## Comparison With markitect-main + +`markitect-main` had several useful seeds: + +- `x-markitect-sections` for required/recommended/optional/discouraged/improper sections. +- `x-markitect-content-control` for required, discouraged, and forbidden patterns plus word-count metrics. +- Section and content validators with warnings/errors. +- Schema generation and validation experiments. +- Draft generation with `x-markitect-field-mapping`. +- Prompt quality gates with schema and pattern validators. +- Infospace entity parsing and LLM classification/evaluation. + +The problem was not lack of ideas. The problem was that the ideas lived in +separate subsystems with different models: + +- Schema validation compared generated schemas rather than validating a stable + document contract. +- Semantic validation used `x-markitect-*` extensions but was not integrated + into a unified contract framework. +- Field mapping existed in draft generation, not in a general form/context + model. +- LLM quality gates existed inside prompt execution, not as provider-neutral + document assessments. +- Infospace checks were domain/application layer behavior, not syntax-layer + primitives. + +## Strategic Direction + +The successor should introduce a framework layer above parsing: + +```text +Markdown parse model + -> document contract + -> section specifications + -> field/form specifications + -> deterministic rules/assertions + -> metric bands + -> optional LLM rubrics + -> unified diagnostics +``` + +This should not replace JSON Schema. JSON Schema remains useful for typed data +and machine validation. The new layer should make document-specific semantics +natural. + +## Recommendation + +Do not continue straight into generic query/transform work until this framework +direction is captured. The next implementation slice should be a small, +deterministic version of document contracts: + +1. Define the contract schema and terminology. +2. Implement section specifications. +3. Implement metric bands. +4. Implement the unified diagnostic model. +5. Leave LLM rubrics and form dynamics as designed extension points for the next + slice. + +This is the utility inflection point. It will make `markitect-tool` practically +useful instead of merely structurally correct. diff --git a/docs/state-hub-integration.md b/docs/state-hub-integration.md index c6bc8ce..12f2a6f 100644 --- a/docs/state-hub-integration.md +++ b/docs/state-hub-integration.md @@ -34,7 +34,7 @@ workplans/ SBOM source: `sbom-tools.yaml`. -Initial SBOM ingest succeeded on 2026-05-03 with seven declared entries for the +Initial SBOM ingest succeeded on 2026-05-03 with eight declared entries for the core and optional dependencies. ## Registered Extension Points diff --git a/pyproject.toml b/pyproject.toml index 52ef696..cb5e0c5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -11,6 +11,7 @@ requires-python = ">=3.12" license = { text = "MIT" } dependencies = [ "click>=8.0", + "jsonschema>=4.0", "markdown-it-py", "PyYAML", ] diff --git a/sbom-tools.yaml b/sbom-tools.yaml index a39110e..195ce05 100644 --- a/sbom-tools.yaml +++ b/sbom-tools.yaml @@ -7,6 +7,10 @@ tools: ecosystem: python is_direct: true is_dev: false + - name: jsonschema + ecosystem: python + is_direct: true + is_dev: false - name: PyYAML ecosystem: python is_direct: true diff --git a/src/markitect_tool/__init__.py b/src/markitect_tool/__init__.py index 3b18cf9..488bf00 100644 --- a/src/markitect_tool/__init__.py +++ b/src/markitect_tool/__init__.py @@ -9,6 +9,14 @@ from markitect_tool.core import ( parse_markdown, parse_markdown_file, ) +from markitect_tool.schema import ( + MarkdownSchema, + SchemaValidationResult, + ValidationViolation, + load_schema_file, + validate_document, + validate_markdown_file, +) __all__ = [ "ContentBlock", @@ -18,4 +26,10 @@ __all__ = [ "Section", "parse_markdown", "parse_markdown_file", + "MarkdownSchema", + "SchemaValidationResult", + "ValidationViolation", + "load_schema_file", + "validate_document", + "validate_markdown_file", ] diff --git a/src/markitect_tool/cli/main.py b/src/markitect_tool/cli/main.py index 3b52d7a..a0ff5bf 100644 --- a/src/markitect_tool/cli/main.py +++ b/src/markitect_tool/cli/main.py @@ -9,6 +9,7 @@ import click import yaml from markitect_tool.core import parse_markdown_file +from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema @click.group() @@ -40,5 +41,66 @@ def parse(file: Path, output_format: str) -> None: click.echo(json.dumps(data, indent=2, ensure_ascii=False)) +@main.command() +@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option( + "--schema", + "schema_file", + required=True, + type=click.Path(exists=True, dir_okay=False, path_type=Path), +) +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def validate(file: Path, schema_file: Path, output_format: str) -> None: + """Validate a Markdown file against a Markdown schema file.""" + + result = validate_markdown_file(file, schema_file) + _emit_result(result.to_dict(), output_format) + raise click.exceptions.Exit(0 if result.valid else 1) + + +@main.group() +def schema() -> None: + """Work with Markdown schema files.""" + + +@schema.command("validate") +@click.argument("schema_file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def schema_validate(schema_file: Path, output_format: str) -> None: + """Validate that a Markdown schema contains a well-formed JSON Schema.""" + + loaded = load_schema_file(schema_file) + result = validate_schema(loaded.schema) + data = result.to_dict() | {"schema_path": str(schema_file)} + _emit_result(data, output_format) + raise click.exceptions.Exit(0 if result.valid else 1) + + +def _emit_result(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + if data.get("valid"): + click.echo("valid") + else: + click.echo("invalid") + for violation in data.get("violations", []): + click.echo(f"- {violation['path']}: {violation['message']}") + + if __name__ == "__main__": main() diff --git a/src/markitect_tool/schema/__init__.py b/src/markitect_tool/schema/__init__.py new file mode 100644 index 0000000..6c8ff5e --- /dev/null +++ b/src/markitect_tool/schema/__init__.py @@ -0,0 +1,31 @@ +"""Schema loading and validation for structured Markdown documents.""" + +from markitect_tool.schema.loader import ( + InvalidSchemaFormatError, + MarkdownSchema, + SchemaLoaderError, + SchemaNotFoundError, + load_schema_file, + load_schema_text, +) +from markitect_tool.schema.validator import ( + SchemaValidationResult, + ValidationViolation, + validate_document, + validate_markdown_file, + validate_schema, +) + +__all__ = [ + "InvalidSchemaFormatError", + "MarkdownSchema", + "SchemaLoaderError", + "SchemaNotFoundError", + "SchemaValidationResult", + "ValidationViolation", + "load_schema_file", + "load_schema_text", + "validate_document", + "validate_markdown_file", + "validate_schema", +] diff --git a/src/markitect_tool/schema/loader.py b/src/markitect_tool/schema/loader.py new file mode 100644 index 0000000..55e7d7e --- /dev/null +++ b/src/markitect_tool/schema/loader.py @@ -0,0 +1,124 @@ +"""Load JSON Schema definitions embedded in Markdown schema files.""" + +from __future__ import annotations + +import json +import re +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +import yaml + + +class SchemaLoaderError(ValueError): + """Base error raised for schema loading failures.""" + + +class SchemaNotFoundError(SchemaLoaderError): + """Raised when no JSON schema block can be found.""" + + +class InvalidSchemaFormatError(SchemaLoaderError): + """Raised when a schema block exists but is not valid JSON object data.""" + + +@dataclass(frozen=True) +class MarkdownSchema: + """A JSON Schema loaded from a Markdown schema document.""" + + schema: dict[str, Any] + metadata: dict[str, Any] + documentation: str + source_path: str | None = None + + def to_dict(self) -> dict[str, Any]: + data = { + "schema": self.schema, + "metadata": self.metadata, + "documentation": self.documentation, + "source_path": self.source_path, + } + return {key: value for key, value in data.items() if value is not None} + + +_JSON_BLOCK_RE = re.compile(r"```json\s*(.*?)```", re.DOTALL | re.IGNORECASE) + + +def load_schema_file(path: str | Path) -> MarkdownSchema: + """Load a Markdown schema file.""" + + schema_path = Path(path) + if not schema_path.exists(): + raise FileNotFoundError(f"Schema file not found: {schema_path}") + return load_schema_text(schema_path.read_text(encoding="utf-8"), source_path=str(schema_path)) + + +def load_schema_text(text: str, source_path: str | None = None) -> MarkdownSchema: + """Load a Markdown schema document from text.""" + + metadata, documentation = _split_frontmatter(text) + schema = _extract_json_schema(documentation) + schema = dict(schema) + schema.setdefault( + "x-markitect-source", + { + "format": "markdown", + "file": source_path, + "frontmatter": metadata, + }, + ) + return MarkdownSchema( + schema=schema, + metadata=metadata, + documentation=documentation, + source_path=source_path, + ) + + +def _split_frontmatter(text: str) -> tuple[dict[str, Any], str]: + if not text.startswith("---\n"): + return {}, text + + end = text.find("\n---", 4) + if end == -1: + return {}, text + + closing_end = text.find("\n", end + 4) + if closing_end == -1: + closing_end = len(text) + else: + closing_end += 1 + + raw = text[4:end] + try: + metadata = yaml.safe_load(raw) if raw.strip() else {} + except yaml.YAMLError as exc: + raise InvalidSchemaFormatError(f"Invalid schema frontmatter: {exc}") from exc + if metadata is None: + metadata = {} + if not isinstance(metadata, dict): + raise InvalidSchemaFormatError("Schema frontmatter must be a mapping") + return metadata, text[closing_end:] + + +def _extract_json_schema(text: str) -> dict[str, Any]: + candidates = list(_JSON_BLOCK_RE.finditer(text)) + if not candidates: + raise SchemaNotFoundError("No JSON schema found in markdown schema") + + parsed_blocks: list[dict[str, Any]] = [] + for match in candidates: + raw_json = match.group(1).strip() + try: + data = json.loads(raw_json) + except json.JSONDecodeError as exc: + raise InvalidSchemaFormatError(f"Invalid JSON schema block: {exc}") from exc + if not isinstance(data, dict): + raise InvalidSchemaFormatError("JSON schema block must contain an object") + parsed_blocks.append(data) + + for data in parsed_blocks: + if "$schema" in data or "type" in data: + return data + return parsed_blocks[0] diff --git a/src/markitect_tool/schema/validator.py b/src/markitect_tool/schema/validator.py new file mode 100644 index 0000000..1c9e74b --- /dev/null +++ b/src/markitect_tool/schema/validator.py @@ -0,0 +1,110 @@ +"""Validate parsed Markdown documents against JSON Schema.""" + +from __future__ import annotations + +from dataclasses import asdict, dataclass +from pathlib import Path +from typing import Any + +from jsonschema import Draft202012Validator, SchemaError, ValidationError + +from markitect_tool.core import Document, parse_markdown_file +from markitect_tool.schema.loader import MarkdownSchema, load_schema_file + + +@dataclass(frozen=True) +class ValidationViolation: + """A single schema validation violation.""" + + path: str + message: str + schema_path: str + + def to_dict(self) -> dict[str, str]: + return asdict(self) + + +@dataclass(frozen=True) +class SchemaValidationResult: + """Validation result for one document and one schema.""" + + valid: bool + violations: list[ValidationViolation] + document_path: str | None = None + schema_path: str | None = None + + def to_dict(self) -> dict[str, Any]: + data = { + "valid": self.valid, + "violations": [violation.to_dict() for violation in self.violations], + "document_path": self.document_path, + "schema_path": self.schema_path, + } + return {key: value for key, value in data.items() if value is not None} + + +def validate_schema(schema: dict[str, Any]) -> SchemaValidationResult: + """Validate that a JSON Schema itself is well formed.""" + + try: + Draft202012Validator.check_schema(schema) + except SchemaError as exc: + return SchemaValidationResult( + valid=False, + violations=[ + ValidationViolation( + path=_format_path(exc.path), + message=exc.message, + schema_path=_format_path(exc.schema_path), + ) + ], + ) + return SchemaValidationResult(valid=True, violations=[]) + + +def validate_markdown_file( + markdown_path: str | Path, schema_path: str | Path +) -> SchemaValidationResult: + """Parse and validate a Markdown file against a Markdown schema file.""" + + document = parse_markdown_file(markdown_path) + loaded_schema = load_schema_file(schema_path) + return validate_document(document, loaded_schema) + + +def validate_document( + document: Document, schema: MarkdownSchema | dict[str, Any] +) -> SchemaValidationResult: + """Validate a parsed document against a loaded or raw JSON Schema.""" + + raw_schema = schema.schema if isinstance(schema, MarkdownSchema) else schema + schema_path = schema.source_path if isinstance(schema, MarkdownSchema) else None + schema_check = validate_schema(raw_schema) + if not schema_check.valid: + return SchemaValidationResult( + valid=False, + violations=schema_check.violations, + document_path=document.source_path, + schema_path=schema_path, + ) + + validator = Draft202012Validator(raw_schema) + violations = [ + ValidationViolation( + path=_format_path(error.path), + message=error.message, + schema_path=_format_path(error.schema_path), + ) + for error in sorted(validator.iter_errors(document.to_dict()), key=str) + ] + return SchemaValidationResult( + valid=not violations, + violations=violations, + document_path=document.source_path, + schema_path=schema_path, + ) + + +def _format_path(path: Any) -> str: + parts = [str(part) for part in path] + return "$" if not parts else "$." + ".".join(parts) diff --git a/tests/fixtures/simple-document-schema.md b/tests/fixtures/simple-document-schema.md new file mode 100644 index 0000000..ab77419 --- /dev/null +++ b/tests/fixtures/simple-document-schema.md @@ -0,0 +1,19 @@ +--- +version: "1.0.0" +--- + +# Simple Document Schema + +```json +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "type": "object", + "required": ["headings"], + "properties": { + "headings": { + "type": "array", + "minItems": 1 + } + } +} +``` diff --git a/tests/fixtures/valid-document.md b/tests/fixtures/valid-document.md new file mode 100644 index 0000000..9d95ef0 --- /dev/null +++ b/tests/fixtures/valid-document.md @@ -0,0 +1,3 @@ +# Hello + +World. diff --git a/tests/test_schema_contract.py b/tests/test_schema_contract.py new file mode 100644 index 0000000..ea06b3f --- /dev/null +++ b/tests/test_schema_contract.py @@ -0,0 +1,164 @@ +from pathlib import Path + +from click.testing import CliRunner + +from markitect_tool.cli import main +from markitect_tool.schema import ( + InvalidSchemaFormatError, + SchemaNotFoundError, + load_schema_file, + validate_markdown_file, + validate_schema, +) + + +SCHEMA_TEXT = """--- +schema-id: "https://example.test/schemas/document/v1" +version: "1.0.0" +status: "stable" +--- + +# Document Schema + +## Schema Definition + +```json +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "title": "Document Schema", + "type": "object", + "required": ["frontmatter", "headings"], + "properties": { + "frontmatter": { + "type": "object", + "required": ["title"], + "properties": { + "title": {"type": "string"} + } + }, + "headings": { + "type": "array", + "minItems": 1, + "items": { + "type": "object", + "required": ["level", "text"], + "properties": { + "level": {"type": "integer"}, + "text": {"type": "string"} + } + } + } + } +} +``` +""" + + +def test_load_schema_file_extracts_metadata_and_json_schema(tmp_path: Path): + schema_file = tmp_path / "document-schema.md" + schema_file.write_text(SCHEMA_TEXT, encoding="utf-8") + + loaded = load_schema_file(schema_file) + + assert loaded.metadata["schema-id"] == "https://example.test/schemas/document/v1" + assert loaded.metadata["status"] == "stable" + assert loaded.schema["title"] == "Document Schema" + assert loaded.schema["x-markitect-source"]["format"] == "markdown" + assert loaded.source_path == str(schema_file) + + +def test_load_schema_file_requires_json_block(tmp_path: Path): + schema_file = tmp_path / "missing.md" + schema_file.write_text("# Missing\n\nNo schema.", encoding="utf-8") + + try: + load_schema_file(schema_file) + except SchemaNotFoundError as exc: + assert "No JSON schema found" in str(exc) + else: + raise AssertionError("expected SchemaNotFoundError") + + +def test_load_schema_file_rejects_invalid_json(tmp_path: Path): + schema_file = tmp_path / "invalid.md" + schema_file.write_text("```json\n{invalid json}\n```", encoding="utf-8") + + try: + load_schema_file(schema_file) + except InvalidSchemaFormatError as exc: + assert "Invalid JSON schema block" in str(exc) + else: + raise AssertionError("expected InvalidSchemaFormatError") + + +def test_validate_markdown_file_returns_valid_result(tmp_path: Path): + schema_file = tmp_path / "document-schema.md" + schema_file.write_text(SCHEMA_TEXT, encoding="utf-8") + markdown_file = tmp_path / "document.md" + markdown_file.write_text("---\ntitle: Example\n---\n\n# Example\n\nBody.", encoding="utf-8") + + result = validate_markdown_file(markdown_file, schema_file) + + assert result.valid is True + assert result.violations == [] + assert result.document_path == str(markdown_file) + assert result.schema_path == str(schema_file) + + +def test_validate_markdown_file_reports_violations(tmp_path: Path): + schema_file = tmp_path / "document-schema.md" + schema_file.write_text(SCHEMA_TEXT, encoding="utf-8") + markdown_file = tmp_path / "document.md" + markdown_file.write_text("# Missing Title\n\nBody.", encoding="utf-8") + + result = validate_markdown_file(markdown_file, schema_file) + + assert result.valid is False + assert result.violations + assert result.violations[0].path == "$.frontmatter" + assert "title" in result.violations[0].message + + +def test_validate_schema_reports_invalid_schema(): + result = validate_schema({"type": 7}) + + assert result.valid is False + assert result.violations + + +def test_mkt_validate_exits_zero_for_valid_document(tmp_path: Path): + schema_file = tmp_path / "document-schema.md" + schema_file.write_text(SCHEMA_TEXT, encoding="utf-8") + markdown_file = tmp_path / "document.md" + markdown_file.write_text("---\ntitle: Example\n---\n\n# Example\n", encoding="utf-8") + + result = CliRunner().invoke( + main, ["validate", str(markdown_file), "--schema", str(schema_file)] + ) + + assert result.exit_code == 0 + assert "valid" in result.output + + +def test_mkt_validate_exits_nonzero_for_invalid_document(tmp_path: Path): + schema_file = tmp_path / "document-schema.md" + schema_file.write_text(SCHEMA_TEXT, encoding="utf-8") + markdown_file = tmp_path / "document.md" + markdown_file.write_text("# Missing Title\n", encoding="utf-8") + + result = CliRunner().invoke( + main, ["validate", str(markdown_file), "--schema", str(schema_file)] + ) + + assert result.exit_code == 1 + assert "invalid" in result.output + + +def test_mkt_schema_validate(tmp_path: Path): + schema_file = tmp_path / "document-schema.md" + schema_file.write_text(SCHEMA_TEXT, encoding="utf-8") + + result = CliRunner().invoke(main, ["schema", "validate", str(schema_file)]) + + assert result.exit_code == 0 + assert "valid" in result.output diff --git a/workplans/MKTT-WP-0003-core-toolkit-implementation.md b/workplans/MKTT-WP-0003-core-toolkit-implementation.md index da3a0f6..d949558 100644 --- a/workplans/MKTT-WP-0003-core-toolkit-implementation.md +++ b/workplans/MKTT-WP-0003-core-toolkit-implementation.md @@ -52,7 +52,7 @@ sections, content blocks, parser tokens, API access, and `mkt parse`. ```task id: MKTT-WP-0003-T003 -status: todo +status: done priority: high state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308" ``` @@ -60,6 +60,9 @@ state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308" Implement FR-010 through FR-012: define/derive schemas, validate documents, and report structured violations with file/location context. +Initial implementation complete for Markdown schema loading, JSON Schema +validation, structured violations, `mkt validate`, and `mkt schema validate`. + ## P3.4 - Implement query and extraction ```task diff --git a/workplans/MKTT-WP-0004-practical-contract-framework.md b/workplans/MKTT-WP-0004-practical-contract-framework.md new file mode 100644 index 0000000..30d1dce --- /dev/null +++ b/workplans/MKTT-WP-0004-practical-contract-framework.md @@ -0,0 +1,154 @@ +--- +id: MKTT-WP-0004 +type: workplan +title: "Practical Document Contract Framework" +domain: markitect +status: proposed +owner: markitect-tool +topic_slug: markitect +created: "2026-05-03" +updated: "2026-05-03" +--- + +# MKTT-WP-0004: Practical Document Contract Framework + +## Purpose + +Improve the practical utility of `markitect-tool` by moving beyond generic +heading-count schema validation toward document contracts with section +specifications, fields/forms, context-aware rules, metric bands, optional LLM +assessments, and unified diagnostics. + +## Background + +Research and legacy comparison are captured in: + +- `docs/practical-schema-framework-research.md` +- `docs/markitect-main-scope-assessment.md` +- `docs/markitect-main-test-migration-inventory.md` + +## P4.1 - Define contract terminology and file format + +```task +id: MKTT-WP-0004-T001 +status: todo +priority: high +``` + +Define the first `DocumentContract` format in markdown/YAML: + +- document type +- section specifications +- field/form specifications +- deterministic rules/assertions +- metric bands +- optional assessment rubrics +- diagnostic metadata + +Keep it provider-neutral and readable by humans. + +## P4.2 - Implement unified diagnostic model + +```task +id: MKTT-WP-0004-T002 +status: todo +priority: high +``` + +Create diagnostics with severity, code, message, source location, contract +location, rule id, and optional repair guidance. Use this model for JSON Schema +violations and all new contract checks. + +## P4.3 - Implement section specifications + +```task +id: MKTT-WP-0004-T003 +status: todo +priority: high +``` + +Support required, recommended, optional, discouraged, and forbidden sections. +Support aliases, expected heading level, section type, ordering constraints, +and clear diagnostics. + +## P4.4 - Implement metric bands + +```task +id: MKTT-WP-0004-T004 +status: todo +priority: medium +``` + +Support document-level and section-level bands for words, characters, +sentences, paragraphs, sections, list items, code blocks, and nesting depth. +Allow soft warnings and hard errors. + +## P4.5 - Design form and context model + +```task +id: MKTT-WP-0004-T005 +status: todo +priority: medium +``` + +Specify fields, defaults, prefill sources, dynamic requiredness, conditional +visibility, calculations, and validation against external context. This task is +design-first; implementation can follow in a later workplan. + +## P4.6 - Design LLM assessment adapter contract + +```task +id: MKTT-WP-0004-T006 +status: todo +priority: medium +``` + +Define provider-neutral request/response models for section-level rubrics: +criteria, inputs, context, score, pass/fail, reason, model metadata, and cache +keys. Do not bind core logic to any provider. + +## P4.7 - Add practical CLI surface + +```task +id: MKTT-WP-0004-T007 +status: todo +priority: high +``` + +Add: + +```text +mkt contract validate +mkt contract check --contract +mkt metrics +``` + +Ensure output is useful to humans and machines. + +## P4.8 - Build use-case examples + +```task +id: MKTT-WP-0004-T008 +status: todo +priority: medium +``` + +Create examples for: + +- ADR +- PRD/FRS +- workplan +- personalized/business letter +- concept note or entity profile + +Each example should include contract, valid document, invalid document, and +expected diagnostics. + +## Decision Point + +This workplan should probably run before WP-0003 query/transform/cache work, +because it changes what "validation" means and establishes the diagnostic model +that later query/transform/generation features should reuse. + +If postponed, continue WP-0003 with query/extraction only if we commit to +revisiting diagnostics and contract semantics before generation or LLM hooks.