Initial schemas and validation with extension workplan

2026-05-03 22:12:46 +02:00
parent b96b1fb745
commit 8c9129c371
15 changed files with 1025 additions and 2 deletions
--- a/docs/markitect-main-scope-assessment.md
+++ b/docs/markitect-main-scope-assessment.md
@@ -47,6 +47,17 @@ consumer needs them through the new library contract:
 4. Treat old code as reference material; do not preserve backward compatibility unless the new contract explicitly needs it.
 5. Keep database, platform, and domain lifecycle concerns out of this repo.
 ## Practicality Reassessment
 The first implementation slices intentionally rebuilt the clean parser and JSON
 Schema spine. That is necessary but not sufficient. The legacy project already
 showed that heading counts and raw structural schemas have limited practical
 utility.
 The successor should prioritize a document contract framework before going much
 deeper into generic tooling. See `docs/practical-schema-framework-research.md`
 and `workplans/MKTT-WP-0004-practical-contract-framework.md`.
 ## Initial Architecture Target
 ```text
--- a/docs/practical-schema-framework-research.md
+++ b/docs/practical-schema-framework-research.md
@@ -0,0 +1,323 @@
 # Practical Schema Framework Research
 Date: 2026-05-03
 ## Purpose
 This document reassesses `markitect-tool` schema utility before further
 implementation. The concern is that pure structural validation, such as heading
 counts and min/max depth constraints, is rarely enough to make markdown document
 pipelines useful.
 The practical opportunity is to define a stronger framework for markdown-native
 document contracts: section specifications, content assertions, form fields,
 context-aware rules, LLM-assisted assessments, and high-quality diagnostics.
 ## Research Signals
 ### Structured Authoring
 DITA is the strongest analogue for typed, reusable textual units. It emphasizes
 information typing, semantic markup, modularity, reuse, interchange, and
 multiple deliverables from one source. A DITA topic is the unit of authoring and
 reuse; topics may be generic or specialized into roles such as concept, task, or
 reference.
 Relevance for `markitect-tool`:
 - A markdown document or section should have an explicit information type.
 - Information type should imply expected structure and reader purpose.
 - Reuse and composition need stable addressing of sections, not only files.
 - Specialization is a better mental model than ad hoc schema forks.
 Sources:
 - https://dita-lang.org/dita/archspec/base/basic-concepts
 - https://dita-lang.org/dita/archspec/base/introduction-to-dita
 ### Document Schemas With Assertions
 DocBook remains relevant because it combines formal document schemas with
 Schematron-style assertions. That is the missing layer in many simplistic JSON
 Schema approaches: grammar says what may exist; assertions say what must be true
 in context.
 Relevance for `markitect-tool`:
 - JSON Schema over `Document.to_dict()` is useful but insufficient.
 - We need a second assertion layer for document-specific semantics.
 - Diagnostics must point to the document location and rule intention.
 Source:
 - https://docbook.org/schemas/docbook/
 ### Dynamic Form Rules
 JSON Schema supports conditional validation through `dependentRequired`,
 `dependentSchemas`, and `if`/`then`/`else`. JSON Forms separates data schema
 from UI schema and uses rules to show, hide, enable, or disable UI elements
 based on JSON Schema conditions. Form.io’s architecture treats the form schema
 as a single source of truth for validation and conditional logic across client
 and server.
 Relevance for `markitect-tool`:
 - Forms should be first-class, not bolted onto document generation.
 - Field definitions need static validation and dynamic rules.
 - Prefill, visibility, requiredness, and calculated values should come from the
  same contract used for generation and validation.
 - Context data must be explicit and typed.
 Sources:
 - https://json-schema.org/understanding-json-schema/reference/conditionals
 - https://jsonforms.io/docs/uischema/rules/
 - https://form.io/features/form-conditional-logic-form-validation/
 ### LLM-Assisted Assessment
 Modern evaluation frameworks treat LLM assessment as explicit graders or
 rubrics. OpenAI graders return scores in a 0–1 range and can combine grader
 types. Promptfoo’s `llm-rubric` uses explicit criteria and expects structured
 judge output with reason, score, and pass/fail.
 Relevance for `markitect-tool`:
 - LLM checks should be declared as assessment rules, not hidden in prompts.
 - Deterministic validation and LLM assessment should produce one diagnostic
  model.
 - Section-level rubrics are more useful than whole-document vague grading.
 - The LLM provider must remain external; `markitect-tool` defines contracts and
  reports.
 Sources:
 - https://developers.openai.com/api/docs/guides/graders
 - https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/
 ### Markdown Structure
 CommonMark gives markdown a well-defined block/inline model. mdast gives a
 language-neutral tree vocabulary for Markdown nodes. Both point toward keeping
 the parse layer separate from domain/schema layers.
 Relevance for `markitect-tool`:
 - The core document model should stay close to CommonMark/mdast concepts.
 - Practical document contracts should sit above the parse model.
 - Section addressing, source spans, and block identity are foundational for good
  diagnostics.
 Sources:
 - https://spec.commonmark.org/0.31.2/
 - https://github.com/syntax-tree/mdast
 ## Terminology Proposal
 | Term | Meaning |
 | --- | --- |
 | Document | A markdown artifact parsed into frontmatter, blocks, headings, sections, and source spans. |
 | Section | A heading-led document region with content, children, source location, and stable identity. |
 | Document Type | A named contract for a whole document, e.g. ADR, PRD, invoice letter, support reply, concept note. |
 | Section Type | A reusable role for a section, e.g. Context, Decision, Risks, Procedure, Evidence, Conclusion. |
 | Field | A typed value expected in frontmatter, inline matter, a section, or an external data record. |
 | Form | A field collection with UI hints, validation rules, defaults, dynamic visibility, and calculations. |
 | Context | External data available during validation/generation, such as user data, project data, dates, or related entities. |
 | Rule | A deterministic condition evaluated against document, fields, context, or pipeline state. |
 | Assertion | A claim that must hold for content, usually richer than shape validation. |
 | Metric Band | A soft or hard target for size/complexity, such as word count, sentence count, section count, or reading level. |
 | Assessment | A deterministic or LLM-assisted evaluation that returns pass/fail, score, reason, and diagnostics. |
 | Rubric | A human-readable criterion for LLM-assisted assessment, scoped to a document or section type. |
 | Diagnostic | A structured finding with severity, code, message, source location, rule id, and suggested repair. |
 | Contract | The full specification for a document type: structure, sections, fields, rules, forms, assertions, rubrics, and outputs. |
 | Pipeline | A repeatable sequence of parse, prefill, generate, validate, assess, transform, and compose operations. |
 ## Most Relevant Use Cases
 ### UC-001: Typed Document Contract
 Define a document type such as ADR, PRD, FRS, workplan, customer letter, or
 meeting brief. Specify required sections by semantic role, allowed alternatives,
 field requirements, and diagnostics.
 Practical value:
 - Prevents missing critical content.
 - Makes generated documents predictable.
 - Creates an explicit contract for humans and agents.
 Needed tooling:
 - `mkt contract check <doc> --contract <contract.md>`
 - Section matching by heading text, aliases, ids, or section type markers.
 - Diagnostics that say which section/field/assertion failed and why.
 ### UC-002: Section-Level Content Expectations
 Specify what a section is expected to contain: assertions, required evidence,
 forbidden omissions, content patterns, examples, and reviewer prompts.
 Practical value:
 - Moves beyond “has a heading” toward “does the section do its job?”
 - Enables review of generated or human-authored text.
 Needed tooling:
 - Deterministic assertions for regex, presence, references, counts, and field
  values.
 - Optional LLM rubrics for semantic content checks.
 - Per-section diagnostic reports.
 ### UC-003: Size and Complexity Bands
 Define soft/hard bands for document and section size: words, characters,
 sentences, paragraphs, sections, list items, code blocks, and nesting depth.
 Practical value:
 - Controls generation output size.
 - Keeps templates from becoming bloated or underdeveloped.
 - Helps compare intended vs actual document complexity.
 Needed tooling:
 - Metrics extractor.
 - Rule severities: info, warning, error.
 - “Too small/too large” diagnostics with actual and target values.
 ### UC-004: Form-Backed Markdown Generation
 Define forms that collect or prefill structured fields, then render markdown
 documents. Fields may be static, calculated, conditional, or context-derived.
 Practical value:
 - Bridges structured data capture and prose generation.
 - Supports repeatable business documents.
 - Makes prefill from user/project/entity data explicit.
 Needed tooling:
 - Field schema.
 - UI schema or form hints.
 - Dynamic rules for requiredness, visibility, defaults, and calculations.
 - Template rendering with validation before and after render.
 ### UC-005: Context-Aware Validation
 Validate a document against external context: user data, project metadata,
 related entities, dates, policy constraints, or canonical terminology.
 Practical value:
 - Checks whether a document is correct for this case, not only generally
  well-formed.
 - Enables pipelines like personalized letters, compliance reports, and
  project-specific workplans.
 Needed tooling:
 - Context object schema.
 - Resolvers for local files, JSON/YAML data, and later higher-layer systems.
 - Rule expressions that can reference document and context paths.
 ### UC-006: LLM-Assisted Section Assessment
 Attach rubrics to section types. Use an external LLM adapter to assess whether a
 section satisfies the rubric, returning score, reason, and pass/fail.
 Practical value:
 - Handles semantic checks that deterministic rules cannot.
 - Supports review loops for generated text.
 - Makes subjective requirements explicit and auditable.
 Needed tooling:
 - Rubric declaration format.
 - Provider-neutral assessment request/response models.
 - Caching and reproducibility metadata.
 - Clear distinction between deterministic errors and model-judged findings.
 ### UC-007: Pipeline Diagnostics and Repair Guidance
 Run a document pipeline and get one coherent diagnostic report from parsing,
 schema checks, field validation, assertions, generation, composition, and
 LLM-assisted assessments.
 Practical value:
 - Makes failures debuggable.
 - Helps humans and agents repair documents.
 - Avoids scattered errors from unrelated subsystems.
 Needed tooling:
 - Common diagnostic model.
 - Error codes and severities.
 - Source spans and rule ids.
 - Suggested repair text or structured patches when safe.
 ## Comparison With markitect-main
 `markitect-main` had several useful seeds:
 - `x-markitect-sections` for required/recommended/optional/discouraged/improper sections.
 - `x-markitect-content-control` for required, discouraged, and forbidden patterns plus word-count metrics.
 - Section and content validators with warnings/errors.
 - Schema generation and validation experiments.
 - Draft generation with `x-markitect-field-mapping`.
 - Prompt quality gates with schema and pattern validators.
 - Infospace entity parsing and LLM classification/evaluation.
 The problem was not lack of ideas. The problem was that the ideas lived in
 separate subsystems with different models:
 - Schema validation compared generated schemas rather than validating a stable
  document contract.
 - Semantic validation used `x-markitect-*` extensions but was not integrated
  into a unified contract framework.
 - Field mapping existed in draft generation, not in a general form/context
  model.
 - LLM quality gates existed inside prompt execution, not as provider-neutral
  document assessments.
 - Infospace checks were domain/application layer behavior, not syntax-layer
  primitives.
 ## Strategic Direction
 The successor should introduce a framework layer above parsing:
 ```text
 Markdown parse model
  -> document contract
      -> section specifications
      -> field/form specifications
      -> deterministic rules/assertions
      -> metric bands
      -> optional LLM rubrics
      -> unified diagnostics
 ```
 This should not replace JSON Schema. JSON Schema remains useful for typed data
 and machine validation. The new layer should make document-specific semantics
 natural.
 ## Recommendation
 Do not continue straight into generic query/transform work until this framework
 direction is captured. The next implementation slice should be a small,
 deterministic version of document contracts:
 1. Define the contract schema and terminology.
 2. Implement section specifications.
 3. Implement metric bands.
 4. Implement the unified diagnostic model.
 5. Leave LLM rubrics and form dynamics as designed extension points for the next
   slice.
 This is the utility inflection point. It will make `markitect-tool` practically
 useful instead of merely structurally correct.
--- a/docs/state-hub-integration.md
+++ b/docs/state-hub-integration.md
@@ -34,7 +34,7 @@ workplans/
 SBOM source: `sbom-tools.yaml`.
-Initial SBOM ingest succeeded on 2026-05-03 with seven declared entries for the
+Initial SBOM ingest succeeded on 2026-05-03 with eight declared entries for the
 core and optional dependencies.
 ## Registered Extension Points
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,6 +11,7 @@ requires-python = ">=3.12"
 license = { text = "MIT" }
 dependencies = [
    "click>=8.0",
    "jsonschema>=4.0",
    "markdown-it-py",
    "PyYAML",
 ]
--- a/sbom-tools.yaml
+++ b/sbom-tools.yaml
@@ -7,6 +7,10 @@ tools:
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: jsonschema
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: PyYAML
    ecosystem: python
    is_direct: true
--- a/src/markitect_tool/init.py
+++ b/src/markitect_tool/init.py
@@ -9,6 +9,14 @@ from markitect_tool.core import (
    parse_markdown,
    parse_markdown_file,
 )
 from markitect_tool.schema import (
    MarkdownSchema,
    SchemaValidationResult,
    ValidationViolation,
    load_schema_file,
    validate_document,
    validate_markdown_file,
 )
 __all__ = [
    "ContentBlock",
@@ -18,4 +26,10 @@ __all__ = [
    "Section",
    "parse_markdown",
    "parse_markdown_file",
    "MarkdownSchema",
    "SchemaValidationResult",
    "ValidationViolation",
    "load_schema_file",
    "validate_document",
    "validate_markdown_file",
 ]
--- a/src/markitect_tool/cli/main.py
+++ b/src/markitect_tool/cli/main.py
@@ -9,6 +9,7 @@ import click
 import yaml
 from markitect_tool.core import parse_markdown_file
 from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema
@click.group()
@@ -40,5 +41,66 @@ def parse(file: Path, output_format: str) -> None:
        click.echo(json.dumps(data, indent=2, ensure_ascii=False))
@main.command()
@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option(
    "--schema",
    "schema_file",
    required=True,
    type=click.Path(exists=True, dir_okay=False, path_type=Path),
 )
@click.option(
    "--format",
    "output_format",
    type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
    default="text",
    show_default=True,
 )
 def validate(file: Path, schema_file: Path, output_format: str) -> None:
    """Validate a Markdown file against a Markdown schema file."""
    result = validate_markdown_file(file, schema_file)
    _emit_result(result.to_dict(), output_format)
    raise click.exceptions.Exit(0 if result.valid else 1)
@main.group()
 def schema() -> None:
    """Work with Markdown schema files."""
@schema.command("validate")
@click.argument("schema_file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option(
    "--format",
    "output_format",
    type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
    default="text",
    show_default=True,
 )
 def schema_validate(schema_file: Path, output_format: str) -> None:
    """Validate that a Markdown schema contains a well-formed JSON Schema."""
    loaded = load_schema_file(schema_file)
    result = validate_schema(loaded.schema)
    data = result.to_dict() | {"schema_path": str(schema_file)}
    _emit_result(data, output_format)
    raise click.exceptions.Exit(0 if result.valid else 1)
 def _emit_result(data: dict, output_format: str) -> None:
    if output_format == "json":
        click.echo(json.dumps(data, indent=2, ensure_ascii=False))
    elif output_format == "yaml":
        click.echo(yaml.safe_dump(data, sort_keys=False))
    else:
        if data.get("valid"):
            click.echo("valid")
        else:
            click.echo("invalid")
            for violation in data.get("violations", []):
                click.echo(f"- {violation['path']}: {violation['message']}")
 if __name__ == "__main__":
    main()
--- a/src/markitect_tool/schema/init.py
+++ b/src/markitect_tool/schema/init.py
@@ -0,0 +1,31 @@
 """Schema loading and validation for structured Markdown documents."""
 from markitect_tool.schema.loader import (
    InvalidSchemaFormatError,
    MarkdownSchema,
    SchemaLoaderError,
    SchemaNotFoundError,
    load_schema_file,
    load_schema_text,
 )
 from markitect_tool.schema.validator import (
    SchemaValidationResult,
    ValidationViolation,
    validate_document,
    validate_markdown_file,
    validate_schema,
 )
 __all__ = [
    "InvalidSchemaFormatError",
    "MarkdownSchema",
    "SchemaLoaderError",
    "SchemaNotFoundError",
    "SchemaValidationResult",
    "ValidationViolation",
    "load_schema_file",
    "load_schema_text",
    "validate_document",
    "validate_markdown_file",
    "validate_schema",
 ]
--- a/src/markitect_tool/schema/loader.py
+++ b/src/markitect_tool/schema/loader.py
@@ -0,0 +1,124 @@
 """Load JSON Schema definitions embedded in Markdown schema files."""
 from __future__ import annotations
 import json
 import re
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any
 import yaml
 class SchemaLoaderError(ValueError):
    """Base error raised for schema loading failures."""
 class SchemaNotFoundError(SchemaLoaderError):
    """Raised when no JSON schema block can be found."""
 class InvalidSchemaFormatError(SchemaLoaderError):
    """Raised when a schema block exists but is not valid JSON object data."""
@dataclass(frozen=True)
 class MarkdownSchema:
    """A JSON Schema loaded from a Markdown schema document."""
    schema: dict[str, Any]
    metadata: dict[str, Any]
    documentation: str
    source_path: str | None = None
    def to_dict(self) -> dict[str, Any]:
        data = {
            "schema": self.schema,
            "metadata": self.metadata,
            "documentation": self.documentation,
            "source_path": self.source_path,
        }
        return {key: value for key, value in data.items() if value is not None}
 _JSON_BLOCK_RE = re.compile(r"```json\s*(.*?)```", re.DOTALL | re.IGNORECASE)
 def load_schema_file(path: str | Path) -> MarkdownSchema:
    """Load a Markdown schema file."""
    schema_path = Path(path)
    if not schema_path.exists():
        raise FileNotFoundError(f"Schema file not found: {schema_path}")
    return load_schema_text(schema_path.read_text(encoding="utf-8"), source_path=str(schema_path))
 def load_schema_text(text: str, source_path: str | None = None) -> MarkdownSchema:
    """Load a Markdown schema document from text."""
    metadata, documentation = _split_frontmatter(text)
    schema = _extract_json_schema(documentation)
    schema = dict(schema)
    schema.setdefault(
        "x-markitect-source",
        {
            "format": "markdown",
            "file": source_path,
            "frontmatter": metadata,
        },
    )
    return MarkdownSchema(
        schema=schema,
        metadata=metadata,
        documentation=documentation,
        source_path=source_path,
    )
 def _split_frontmatter(text: str) -> tuple[dict[str, Any], str]:
    if not text.startswith("---\n"):
        return {}, text
    end = text.find("\n---", 4)
    if end == -1:
        return {}, text
    closing_end = text.find("\n", end + 4)
    if closing_end == -1:
        closing_end = len(text)
    else:
        closing_end += 1
    raw = text[4:end]
    try:
        metadata = yaml.safe_load(raw) if raw.strip() else {}
    except yaml.YAMLError as exc:
        raise InvalidSchemaFormatError(f"Invalid schema frontmatter: {exc}") from exc
    if metadata is None:
        metadata = {}
    if not isinstance(metadata, dict):
        raise InvalidSchemaFormatError("Schema frontmatter must be a mapping")
    return metadata, text[closing_end:]
 def _extract_json_schema(text: str) -> dict[str, Any]:
    candidates = list(_JSON_BLOCK_RE.finditer(text))
    if not candidates:
        raise SchemaNotFoundError("No JSON schema found in markdown schema")
    parsed_blocks: list[dict[str, Any]] = []
    for match in candidates:
        raw_json = match.group(1).strip()
        try:
            data = json.loads(raw_json)
        except json.JSONDecodeError as exc:
            raise InvalidSchemaFormatError(f"Invalid JSON schema block: {exc}") from exc
        if not isinstance(data, dict):
            raise InvalidSchemaFormatError("JSON schema block must contain an object")
        parsed_blocks.append(data)
    for data in parsed_blocks:
        if "$schema" in data or "type" in data:
            return data
    return parsed_blocks[0]
--- a/src/markitect_tool/schema/validator.py
+++ b/src/markitect_tool/schema/validator.py
@@ -0,0 +1,110 @@
 """Validate parsed Markdown documents against JSON Schema."""
 from __future__ import annotations
 from dataclasses import asdict, dataclass
 from pathlib import Path
 from typing import Any
 from jsonschema import Draft202012Validator, SchemaError, ValidationError
 from markitect_tool.core import Document, parse_markdown_file
 from markitect_tool.schema.loader import MarkdownSchema, load_schema_file
@dataclass(frozen=True)
 class ValidationViolation:
    """A single schema validation violation."""
    path: str
    message: str
    schema_path: str
    def to_dict(self) -> dict[str, str]:
        return asdict(self)
@dataclass(frozen=True)
 class SchemaValidationResult:
    """Validation result for one document and one schema."""
    valid: bool
    violations: list[ValidationViolation]
    document_path: str | None = None
    schema_path: str | None = None
    def to_dict(self) -> dict[str, Any]:
        data = {
            "valid": self.valid,
            "violations": [violation.to_dict() for violation in self.violations],
            "document_path": self.document_path,
            "schema_path": self.schema_path,
        }
        return {key: value for key, value in data.items() if value is not None}
 def validate_schema(schema: dict[str, Any]) -> SchemaValidationResult:
    """Validate that a JSON Schema itself is well formed."""
    try:
        Draft202012Validator.check_schema(schema)
    except SchemaError as exc:
        return SchemaValidationResult(
            valid=False,
            violations=[
                ValidationViolation(
                    path=_format_path(exc.path),
                    message=exc.message,
                    schema_path=_format_path(exc.schema_path),
                )
            ],
        )
    return SchemaValidationResult(valid=True, violations=[])
 def validate_markdown_file(
    markdown_path: str | Path, schema_path: str | Path
 ) -> SchemaValidationResult:
    """Parse and validate a Markdown file against a Markdown schema file."""
    document = parse_markdown_file(markdown_path)
    loaded_schema = load_schema_file(schema_path)
    return validate_document(document, loaded_schema)
 def validate_document(
    document: Document, schema: MarkdownSchema | dict[str, Any]
 ) -> SchemaValidationResult:
    """Validate a parsed document against a loaded or raw JSON Schema."""
    raw_schema = schema.schema if isinstance(schema, MarkdownSchema) else schema
    schema_path = schema.source_path if isinstance(schema, MarkdownSchema) else None
    schema_check = validate_schema(raw_schema)
    if not schema_check.valid:
        return SchemaValidationResult(
            valid=False,
            violations=schema_check.violations,
            document_path=document.source_path,
            schema_path=schema_path,
        )
    validator = Draft202012Validator(raw_schema)
    violations = [
        ValidationViolation(
            path=_format_path(error.path),
            message=error.message,
            schema_path=_format_path(error.schema_path),
        )
        for error in sorted(validator.iter_errors(document.to_dict()), key=str)
    ]
    return SchemaValidationResult(
        valid=not violations,
        violations=violations,
        document_path=document.source_path,
        schema_path=schema_path,
    )
 def _format_path(path: Any) -> str:
    parts = [str(part) for part in path]
    return "$" if not parts else "$." + ".".join(parts)
--- a/tests/fixtures/simple-document-schema.md
+++ b/tests/fixtures/simple-document-schema.md
@@ -0,0 +1,19 @@
 ---
 version: "1.0.0"
 ---
 # Simple Document Schema
 ```json
 {
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["headings"],
  "properties": {
    "headings": {
      "type": "array",
      "minItems": 1
    }
  }
 }
 ```
--- a/tests/fixtures/valid-document.md
+++ b/tests/fixtures/valid-document.md
@@ -0,0 +1,3 @@
 # Hello
 World.
--- a/tests/test_schema_contract.py
+++ b/tests/test_schema_contract.py
@@ -0,0 +1,164 @@
 from pathlib import Path
 from click.testing import CliRunner
 from markitect_tool.cli import main
 from markitect_tool.schema import (
    InvalidSchemaFormatError,
    SchemaNotFoundError,
    load_schema_file,
    validate_markdown_file,
    validate_schema,
 )
 SCHEMA_TEXT = """---
 schema-id: "https://example.test/schemas/document/v1"
 version: "1.0.0"
 status: "stable"
 ---
 # Document Schema
 ## Schema Definition
 ```json
 {
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Document Schema",
  "type": "object",
  "required": ["frontmatter", "headings"],
  "properties": {
    "frontmatter": {
      "type": "object",
      "required": ["title"],
      "properties": {
        "title": {"type": "string"}
      }
    },
    "headings": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["level", "text"],
        "properties": {
          "level": {"type": "integer"},
          "text": {"type": "string"}
        }
      }
    }
  }
 }
 ```
 """
 def test_load_schema_file_extracts_metadata_and_json_schema(tmp_path: Path):
    schema_file = tmp_path / "document-schema.md"
    schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
    loaded = load_schema_file(schema_file)
    assert loaded.metadata["schema-id"] == "https://example.test/schemas/document/v1"
    assert loaded.metadata["status"] == "stable"
    assert loaded.schema["title"] == "Document Schema"
    assert loaded.schema["x-markitect-source"]["format"] == "markdown"
    assert loaded.source_path == str(schema_file)
 def test_load_schema_file_requires_json_block(tmp_path: Path):
    schema_file = tmp_path / "missing.md"
    schema_file.write_text("# Missing\n\nNo schema.", encoding="utf-8")
    try:
        load_schema_file(schema_file)
    except SchemaNotFoundError as exc:
        assert "No JSON schema found" in str(exc)
    else:
        raise AssertionError("expected SchemaNotFoundError")
 def test_load_schema_file_rejects_invalid_json(tmp_path: Path):
    schema_file = tmp_path / "invalid.md"
    schema_file.write_text("```json\n{invalid json}\n```", encoding="utf-8")
    try:
        load_schema_file(schema_file)
    except InvalidSchemaFormatError as exc:
        assert "Invalid JSON schema block" in str(exc)
    else:
        raise AssertionError("expected InvalidSchemaFormatError")
 def test_validate_markdown_file_returns_valid_result(tmp_path: Path):
    schema_file = tmp_path / "document-schema.md"
    schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
    markdown_file = tmp_path / "document.md"
    markdown_file.write_text("---\ntitle: Example\n---\n\n# Example\n\nBody.", encoding="utf-8")
    result = validate_markdown_file(markdown_file, schema_file)
    assert result.valid is True
    assert result.violations == []
    assert result.document_path == str(markdown_file)
    assert result.schema_path == str(schema_file)
 def test_validate_markdown_file_reports_violations(tmp_path: Path):
    schema_file = tmp_path / "document-schema.md"
    schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
    markdown_file = tmp_path / "document.md"
    markdown_file.write_text("# Missing Title\n\nBody.", encoding="utf-8")
    result = validate_markdown_file(markdown_file, schema_file)
    assert result.valid is False
    assert result.violations
    assert result.violations[0].path == "$.frontmatter"
    assert "title" in result.violations[0].message
 def test_validate_schema_reports_invalid_schema():
    result = validate_schema({"type": 7})
    assert result.valid is False
    assert result.violations
 def test_mkt_validate_exits_zero_for_valid_document(tmp_path: Path):
    schema_file = tmp_path / "document-schema.md"
    schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
    markdown_file = tmp_path / "document.md"
    markdown_file.write_text("---\ntitle: Example\n---\n\n# Example\n", encoding="utf-8")
    result = CliRunner().invoke(
        main, ["validate", str(markdown_file), "--schema", str(schema_file)]
    )
    assert result.exit_code == 0
    assert "valid" in result.output
 def test_mkt_validate_exits_nonzero_for_invalid_document(tmp_path: Path):
    schema_file = tmp_path / "document-schema.md"
    schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
    markdown_file = tmp_path / "document.md"
    markdown_file.write_text("# Missing Title\n", encoding="utf-8")
    result = CliRunner().invoke(
        main, ["validate", str(markdown_file), "--schema", str(schema_file)]
    )
    assert result.exit_code == 1
    assert "invalid" in result.output
 def test_mkt_schema_validate(tmp_path: Path):
    schema_file = tmp_path / "document-schema.md"
    schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
    result = CliRunner().invoke(main, ["schema", "validate", str(schema_file)])
    assert result.exit_code == 0
    assert "valid" in result.output
--- a/workplans/MKTT-WP-0003-core-toolkit-implementation.md
+++ b/workplans/MKTT-WP-0003-core-toolkit-implementation.md
@@ -52,7 +52,7 @@ sections, content blocks, parser tokens, API access, and `mkt parse`.
 ```task
 id: MKTT-WP-0003-T003
-status: todo
+status: done
 priority: high
 state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308"
 ```
@@ -60,6 +60,9 @@ state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308"
 Implement FR-010 through FR-012: define/derive schemas, validate documents,
 and report structured violations with file/location context.
 Initial implementation complete for Markdown schema loading, JSON Schema
 validation, structured violations, `mkt validate`, and `mkt schema validate`.
 ## P3.4 - Implement query and extraction
 ```task
--- a/workplans/MKTT-WP-0004-practical-contract-framework.md
+++ b/workplans/MKTT-WP-0004-practical-contract-framework.md
@@ -0,0 +1,154 @@
 ---
 id: MKTT-WP-0004
 type: workplan
 title: "Practical Document Contract Framework"
 domain: markitect
 status: proposed
 owner: markitect-tool
 topic_slug: markitect
 created: "2026-05-03"
 updated: "2026-05-03"
 ---
 # MKTT-WP-0004: Practical Document Contract Framework
 ## Purpose
 Improve the practical utility of `markitect-tool` by moving beyond generic
 heading-count schema validation toward document contracts with section
 specifications, fields/forms, context-aware rules, metric bands, optional LLM
 assessments, and unified diagnostics.
 ## Background
 Research and legacy comparison are captured in:
 - `docs/practical-schema-framework-research.md`
 - `docs/markitect-main-scope-assessment.md`
 - `docs/markitect-main-test-migration-inventory.md`
 ## P4.1 - Define contract terminology and file format
 ```task
 id: MKTT-WP-0004-T001
 status: todo
 priority: high
 ```
 Define the first `DocumentContract` format in markdown/YAML:
 - document type
 - section specifications
 - field/form specifications
 - deterministic rules/assertions
 - metric bands
 - optional assessment rubrics
 - diagnostic metadata
 Keep it provider-neutral and readable by humans.
 ## P4.2 - Implement unified diagnostic model
 ```task
 id: MKTT-WP-0004-T002
 status: todo
 priority: high
 ```
 Create diagnostics with severity, code, message, source location, contract
 location, rule id, and optional repair guidance. Use this model for JSON Schema
 violations and all new contract checks.
 ## P4.3 - Implement section specifications
 ```task
 id: MKTT-WP-0004-T003
 status: todo
 priority: high
 ```
 Support required, recommended, optional, discouraged, and forbidden sections.
 Support aliases, expected heading level, section type, ordering constraints,
 and clear diagnostics.
 ## P4.4 - Implement metric bands
 ```task
 id: MKTT-WP-0004-T004
 status: todo
 priority: medium
 ```
 Support document-level and section-level bands for words, characters,
 sentences, paragraphs, sections, list items, code blocks, and nesting depth.
 Allow soft warnings and hard errors.
 ## P4.5 - Design form and context model
 ```task
 id: MKTT-WP-0004-T005
 status: todo
 priority: medium
 ```
 Specify fields, defaults, prefill sources, dynamic requiredness, conditional
 visibility, calculations, and validation against external context. This task is
 design-first; implementation can follow in a later workplan.
 ## P4.6 - Design LLM assessment adapter contract
 ```task
 id: MKTT-WP-0004-T006
 status: todo
 priority: medium
 ```
 Define provider-neutral request/response models for section-level rubrics:
 criteria, inputs, context, score, pass/fail, reason, model metadata, and cache
 keys. Do not bind core logic to any provider.
 ## P4.7 - Add practical CLI surface
 ```task
 id: MKTT-WP-0004-T007
 status: todo
 priority: high
 ```
 Add:
 ```text
 mkt contract validate <contract.md>
 mkt contract check <document.md> --contract <contract.md>
 mkt metrics <document.md>
 ```
 Ensure output is useful to humans and machines.
 ## P4.8 - Build use-case examples
 ```task
 id: MKTT-WP-0004-T008
 status: todo
 priority: medium
 ```
 Create examples for:
 - ADR
 - PRD/FRS
 - workplan
 - personalized/business letter
 - concept note or entity profile
 Each example should include contract, valid document, invalid document, and
 expected diagnostics.
 ## Decision Point
 This workplan should probably run before WP-0003 query/transform/cache work,
 because it changes what "validation" means and establishes the diagnostic model
 that later query/transform/generation features should reuse.
 If postponed, continue WP-0003 with query/extraction only if we commit to
 revisiting diagnostics and contract semantics before generation or LLM hooks.