Initial schemas and validation with extension workplan

This commit is contained in:
2026-05-03 22:12:46 +02:00
parent b96b1fb745
commit 8c9129c371
15 changed files with 1025 additions and 2 deletions

View File

@@ -47,6 +47,17 @@ consumer needs them through the new library contract:
4. Treat old code as reference material; do not preserve backward compatibility unless the new contract explicitly needs it. 4. Treat old code as reference material; do not preserve backward compatibility unless the new contract explicitly needs it.
5. Keep database, platform, and domain lifecycle concerns out of this repo. 5. Keep database, platform, and domain lifecycle concerns out of this repo.
## Practicality Reassessment
The first implementation slices intentionally rebuilt the clean parser and JSON
Schema spine. That is necessary but not sufficient. The legacy project already
showed that heading counts and raw structural schemas have limited practical
utility.
The successor should prioritize a document contract framework before going much
deeper into generic tooling. See `docs/practical-schema-framework-research.md`
and `workplans/MKTT-WP-0004-practical-contract-framework.md`.
## Initial Architecture Target ## Initial Architecture Target
```text ```text

View File

@@ -0,0 +1,323 @@
# Practical Schema Framework Research
Date: 2026-05-03
## Purpose
This document reassesses `markitect-tool` schema utility before further
implementation. The concern is that pure structural validation, such as heading
counts and min/max depth constraints, is rarely enough to make markdown document
pipelines useful.
The practical opportunity is to define a stronger framework for markdown-native
document contracts: section specifications, content assertions, form fields,
context-aware rules, LLM-assisted assessments, and high-quality diagnostics.
## Research Signals
### Structured Authoring
DITA is the strongest analogue for typed, reusable textual units. It emphasizes
information typing, semantic markup, modularity, reuse, interchange, and
multiple deliverables from one source. A DITA topic is the unit of authoring and
reuse; topics may be generic or specialized into roles such as concept, task, or
reference.
Relevance for `markitect-tool`:
- A markdown document or section should have an explicit information type.
- Information type should imply expected structure and reader purpose.
- Reuse and composition need stable addressing of sections, not only files.
- Specialization is a better mental model than ad hoc schema forks.
Sources:
- https://dita-lang.org/dita/archspec/base/basic-concepts
- https://dita-lang.org/dita/archspec/base/introduction-to-dita
### Document Schemas With Assertions
DocBook remains relevant because it combines formal document schemas with
Schematron-style assertions. That is the missing layer in many simplistic JSON
Schema approaches: grammar says what may exist; assertions say what must be true
in context.
Relevance for `markitect-tool`:
- JSON Schema over `Document.to_dict()` is useful but insufficient.
- We need a second assertion layer for document-specific semantics.
- Diagnostics must point to the document location and rule intention.
Source:
- https://docbook.org/schemas/docbook/
### Dynamic Form Rules
JSON Schema supports conditional validation through `dependentRequired`,
`dependentSchemas`, and `if`/`then`/`else`. JSON Forms separates data schema
from UI schema and uses rules to show, hide, enable, or disable UI elements
based on JSON Schema conditions. Form.ios architecture treats the form schema
as a single source of truth for validation and conditional logic across client
and server.
Relevance for `markitect-tool`:
- Forms should be first-class, not bolted onto document generation.
- Field definitions need static validation and dynamic rules.
- Prefill, visibility, requiredness, and calculated values should come from the
same contract used for generation and validation.
- Context data must be explicit and typed.
Sources:
- https://json-schema.org/understanding-json-schema/reference/conditionals
- https://jsonforms.io/docs/uischema/rules/
- https://form.io/features/form-conditional-logic-form-validation/
### LLM-Assisted Assessment
Modern evaluation frameworks treat LLM assessment as explicit graders or
rubrics. OpenAI graders return scores in a 01 range and can combine grader
types. Promptfoos `llm-rubric` uses explicit criteria and expects structured
judge output with reason, score, and pass/fail.
Relevance for `markitect-tool`:
- LLM checks should be declared as assessment rules, not hidden in prompts.
- Deterministic validation and LLM assessment should produce one diagnostic
model.
- Section-level rubrics are more useful than whole-document vague grading.
- The LLM provider must remain external; `markitect-tool` defines contracts and
reports.
Sources:
- https://developers.openai.com/api/docs/guides/graders
- https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/
### Markdown Structure
CommonMark gives markdown a well-defined block/inline model. mdast gives a
language-neutral tree vocabulary for Markdown nodes. Both point toward keeping
the parse layer separate from domain/schema layers.
Relevance for `markitect-tool`:
- The core document model should stay close to CommonMark/mdast concepts.
- Practical document contracts should sit above the parse model.
- Section addressing, source spans, and block identity are foundational for good
diagnostics.
Sources:
- https://spec.commonmark.org/0.31.2/
- https://github.com/syntax-tree/mdast
## Terminology Proposal
| Term | Meaning |
| --- | --- |
| Document | A markdown artifact parsed into frontmatter, blocks, headings, sections, and source spans. |
| Section | A heading-led document region with content, children, source location, and stable identity. |
| Document Type | A named contract for a whole document, e.g. ADR, PRD, invoice letter, support reply, concept note. |
| Section Type | A reusable role for a section, e.g. Context, Decision, Risks, Procedure, Evidence, Conclusion. |
| Field | A typed value expected in frontmatter, inline matter, a section, or an external data record. |
| Form | A field collection with UI hints, validation rules, defaults, dynamic visibility, and calculations. |
| Context | External data available during validation/generation, such as user data, project data, dates, or related entities. |
| Rule | A deterministic condition evaluated against document, fields, context, or pipeline state. |
| Assertion | A claim that must hold for content, usually richer than shape validation. |
| Metric Band | A soft or hard target for size/complexity, such as word count, sentence count, section count, or reading level. |
| Assessment | A deterministic or LLM-assisted evaluation that returns pass/fail, score, reason, and diagnostics. |
| Rubric | A human-readable criterion for LLM-assisted assessment, scoped to a document or section type. |
| Diagnostic | A structured finding with severity, code, message, source location, rule id, and suggested repair. |
| Contract | The full specification for a document type: structure, sections, fields, rules, forms, assertions, rubrics, and outputs. |
| Pipeline | A repeatable sequence of parse, prefill, generate, validate, assess, transform, and compose operations. |
## Most Relevant Use Cases
### UC-001: Typed Document Contract
Define a document type such as ADR, PRD, FRS, workplan, customer letter, or
meeting brief. Specify required sections by semantic role, allowed alternatives,
field requirements, and diagnostics.
Practical value:
- Prevents missing critical content.
- Makes generated documents predictable.
- Creates an explicit contract for humans and agents.
Needed tooling:
- `mkt contract check <doc> --contract <contract.md>`
- Section matching by heading text, aliases, ids, or section type markers.
- Diagnostics that say which section/field/assertion failed and why.
### UC-002: Section-Level Content Expectations
Specify what a section is expected to contain: assertions, required evidence,
forbidden omissions, content patterns, examples, and reviewer prompts.
Practical value:
- Moves beyond “has a heading” toward “does the section do its job?”
- Enables review of generated or human-authored text.
Needed tooling:
- Deterministic assertions for regex, presence, references, counts, and field
values.
- Optional LLM rubrics for semantic content checks.
- Per-section diagnostic reports.
### UC-003: Size and Complexity Bands
Define soft/hard bands for document and section size: words, characters,
sentences, paragraphs, sections, list items, code blocks, and nesting depth.
Practical value:
- Controls generation output size.
- Keeps templates from becoming bloated or underdeveloped.
- Helps compare intended vs actual document complexity.
Needed tooling:
- Metrics extractor.
- Rule severities: info, warning, error.
- “Too small/too large” diagnostics with actual and target values.
### UC-004: Form-Backed Markdown Generation
Define forms that collect or prefill structured fields, then render markdown
documents. Fields may be static, calculated, conditional, or context-derived.
Practical value:
- Bridges structured data capture and prose generation.
- Supports repeatable business documents.
- Makes prefill from user/project/entity data explicit.
Needed tooling:
- Field schema.
- UI schema or form hints.
- Dynamic rules for requiredness, visibility, defaults, and calculations.
- Template rendering with validation before and after render.
### UC-005: Context-Aware Validation
Validate a document against external context: user data, project metadata,
related entities, dates, policy constraints, or canonical terminology.
Practical value:
- Checks whether a document is correct for this case, not only generally
well-formed.
- Enables pipelines like personalized letters, compliance reports, and
project-specific workplans.
Needed tooling:
- Context object schema.
- Resolvers for local files, JSON/YAML data, and later higher-layer systems.
- Rule expressions that can reference document and context paths.
### UC-006: LLM-Assisted Section Assessment
Attach rubrics to section types. Use an external LLM adapter to assess whether a
section satisfies the rubric, returning score, reason, and pass/fail.
Practical value:
- Handles semantic checks that deterministic rules cannot.
- Supports review loops for generated text.
- Makes subjective requirements explicit and auditable.
Needed tooling:
- Rubric declaration format.
- Provider-neutral assessment request/response models.
- Caching and reproducibility metadata.
- Clear distinction between deterministic errors and model-judged findings.
### UC-007: Pipeline Diagnostics and Repair Guidance
Run a document pipeline and get one coherent diagnostic report from parsing,
schema checks, field validation, assertions, generation, composition, and
LLM-assisted assessments.
Practical value:
- Makes failures debuggable.
- Helps humans and agents repair documents.
- Avoids scattered errors from unrelated subsystems.
Needed tooling:
- Common diagnostic model.
- Error codes and severities.
- Source spans and rule ids.
- Suggested repair text or structured patches when safe.
## Comparison With markitect-main
`markitect-main` had several useful seeds:
- `x-markitect-sections` for required/recommended/optional/discouraged/improper sections.
- `x-markitect-content-control` for required, discouraged, and forbidden patterns plus word-count metrics.
- Section and content validators with warnings/errors.
- Schema generation and validation experiments.
- Draft generation with `x-markitect-field-mapping`.
- Prompt quality gates with schema and pattern validators.
- Infospace entity parsing and LLM classification/evaluation.
The problem was not lack of ideas. The problem was that the ideas lived in
separate subsystems with different models:
- Schema validation compared generated schemas rather than validating a stable
document contract.
- Semantic validation used `x-markitect-*` extensions but was not integrated
into a unified contract framework.
- Field mapping existed in draft generation, not in a general form/context
model.
- LLM quality gates existed inside prompt execution, not as provider-neutral
document assessments.
- Infospace checks were domain/application layer behavior, not syntax-layer
primitives.
## Strategic Direction
The successor should introduce a framework layer above parsing:
```text
Markdown parse model
-> document contract
-> section specifications
-> field/form specifications
-> deterministic rules/assertions
-> metric bands
-> optional LLM rubrics
-> unified diagnostics
```
This should not replace JSON Schema. JSON Schema remains useful for typed data
and machine validation. The new layer should make document-specific semantics
natural.
## Recommendation
Do not continue straight into generic query/transform work until this framework
direction is captured. The next implementation slice should be a small,
deterministic version of document contracts:
1. Define the contract schema and terminology.
2. Implement section specifications.
3. Implement metric bands.
4. Implement the unified diagnostic model.
5. Leave LLM rubrics and form dynamics as designed extension points for the next
slice.
This is the utility inflection point. It will make `markitect-tool` practically
useful instead of merely structurally correct.

View File

@@ -34,7 +34,7 @@ workplans/
SBOM source: `sbom-tools.yaml`. SBOM source: `sbom-tools.yaml`.
Initial SBOM ingest succeeded on 2026-05-03 with seven declared entries for the Initial SBOM ingest succeeded on 2026-05-03 with eight declared entries for the
core and optional dependencies. core and optional dependencies.
## Registered Extension Points ## Registered Extension Points

View File

@@ -11,6 +11,7 @@ requires-python = ">=3.12"
license = { text = "MIT" } license = { text = "MIT" }
dependencies = [ dependencies = [
"click>=8.0", "click>=8.0",
"jsonschema>=4.0",
"markdown-it-py", "markdown-it-py",
"PyYAML", "PyYAML",
] ]

View File

@@ -7,6 +7,10 @@ tools:
ecosystem: python ecosystem: python
is_direct: true is_direct: true
is_dev: false is_dev: false
- name: jsonschema
ecosystem: python
is_direct: true
is_dev: false
- name: PyYAML - name: PyYAML
ecosystem: python ecosystem: python
is_direct: true is_direct: true

View File

@@ -9,6 +9,14 @@ from markitect_tool.core import (
parse_markdown, parse_markdown,
parse_markdown_file, parse_markdown_file,
) )
from markitect_tool.schema import (
MarkdownSchema,
SchemaValidationResult,
ValidationViolation,
load_schema_file,
validate_document,
validate_markdown_file,
)
__all__ = [ __all__ = [
"ContentBlock", "ContentBlock",
@@ -18,4 +26,10 @@ __all__ = [
"Section", "Section",
"parse_markdown", "parse_markdown",
"parse_markdown_file", "parse_markdown_file",
"MarkdownSchema",
"SchemaValidationResult",
"ValidationViolation",
"load_schema_file",
"validate_document",
"validate_markdown_file",
] ]

View File

@@ -9,6 +9,7 @@ import click
import yaml import yaml
from markitect_tool.core import parse_markdown_file from markitect_tool.core import parse_markdown_file
from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema
@click.group() @click.group()
@@ -40,5 +41,66 @@ def parse(file: Path, output_format: str) -> None:
click.echo(json.dumps(data, indent=2, ensure_ascii=False)) click.echo(json.dumps(data, indent=2, ensure_ascii=False))
@main.command()
@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option(
"--schema",
"schema_file",
required=True,
type=click.Path(exists=True, dir_okay=False, path_type=Path),
)
@click.option(
"--format",
"output_format",
type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
default="text",
show_default=True,
)
def validate(file: Path, schema_file: Path, output_format: str) -> None:
"""Validate a Markdown file against a Markdown schema file."""
result = validate_markdown_file(file, schema_file)
_emit_result(result.to_dict(), output_format)
raise click.exceptions.Exit(0 if result.valid else 1)
@main.group()
def schema() -> None:
"""Work with Markdown schema files."""
@schema.command("validate")
@click.argument("schema_file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option(
"--format",
"output_format",
type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
default="text",
show_default=True,
)
def schema_validate(schema_file: Path, output_format: str) -> None:
"""Validate that a Markdown schema contains a well-formed JSON Schema."""
loaded = load_schema_file(schema_file)
result = validate_schema(loaded.schema)
data = result.to_dict() | {"schema_path": str(schema_file)}
_emit_result(data, output_format)
raise click.exceptions.Exit(0 if result.valid else 1)
def _emit_result(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))
elif output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))
else:
if data.get("valid"):
click.echo("valid")
else:
click.echo("invalid")
for violation in data.get("violations", []):
click.echo(f"- {violation['path']}: {violation['message']}")
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -0,0 +1,31 @@
"""Schema loading and validation for structured Markdown documents."""
from markitect_tool.schema.loader import (
InvalidSchemaFormatError,
MarkdownSchema,
SchemaLoaderError,
SchemaNotFoundError,
load_schema_file,
load_schema_text,
)
from markitect_tool.schema.validator import (
SchemaValidationResult,
ValidationViolation,
validate_document,
validate_markdown_file,
validate_schema,
)
__all__ = [
"InvalidSchemaFormatError",
"MarkdownSchema",
"SchemaLoaderError",
"SchemaNotFoundError",
"SchemaValidationResult",
"ValidationViolation",
"load_schema_file",
"load_schema_text",
"validate_document",
"validate_markdown_file",
"validate_schema",
]

View File

@@ -0,0 +1,124 @@
"""Load JSON Schema definitions embedded in Markdown schema files."""
from __future__ import annotations
import json
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import yaml
class SchemaLoaderError(ValueError):
"""Base error raised for schema loading failures."""
class SchemaNotFoundError(SchemaLoaderError):
"""Raised when no JSON schema block can be found."""
class InvalidSchemaFormatError(SchemaLoaderError):
"""Raised when a schema block exists but is not valid JSON object data."""
@dataclass(frozen=True)
class MarkdownSchema:
"""A JSON Schema loaded from a Markdown schema document."""
schema: dict[str, Any]
metadata: dict[str, Any]
documentation: str
source_path: str | None = None
def to_dict(self) -> dict[str, Any]:
data = {
"schema": self.schema,
"metadata": self.metadata,
"documentation": self.documentation,
"source_path": self.source_path,
}
return {key: value for key, value in data.items() if value is not None}
_JSON_BLOCK_RE = re.compile(r"```json\s*(.*?)```", re.DOTALL | re.IGNORECASE)
def load_schema_file(path: str | Path) -> MarkdownSchema:
"""Load a Markdown schema file."""
schema_path = Path(path)
if not schema_path.exists():
raise FileNotFoundError(f"Schema file not found: {schema_path}")
return load_schema_text(schema_path.read_text(encoding="utf-8"), source_path=str(schema_path))
def load_schema_text(text: str, source_path: str | None = None) -> MarkdownSchema:
"""Load a Markdown schema document from text."""
metadata, documentation = _split_frontmatter(text)
schema = _extract_json_schema(documentation)
schema = dict(schema)
schema.setdefault(
"x-markitect-source",
{
"format": "markdown",
"file": source_path,
"frontmatter": metadata,
},
)
return MarkdownSchema(
schema=schema,
metadata=metadata,
documentation=documentation,
source_path=source_path,
)
def _split_frontmatter(text: str) -> tuple[dict[str, Any], str]:
if not text.startswith("---\n"):
return {}, text
end = text.find("\n---", 4)
if end == -1:
return {}, text
closing_end = text.find("\n", end + 4)
if closing_end == -1:
closing_end = len(text)
else:
closing_end += 1
raw = text[4:end]
try:
metadata = yaml.safe_load(raw) if raw.strip() else {}
except yaml.YAMLError as exc:
raise InvalidSchemaFormatError(f"Invalid schema frontmatter: {exc}") from exc
if metadata is None:
metadata = {}
if not isinstance(metadata, dict):
raise InvalidSchemaFormatError("Schema frontmatter must be a mapping")
return metadata, text[closing_end:]
def _extract_json_schema(text: str) -> dict[str, Any]:
candidates = list(_JSON_BLOCK_RE.finditer(text))
if not candidates:
raise SchemaNotFoundError("No JSON schema found in markdown schema")
parsed_blocks: list[dict[str, Any]] = []
for match in candidates:
raw_json = match.group(1).strip()
try:
data = json.loads(raw_json)
except json.JSONDecodeError as exc:
raise InvalidSchemaFormatError(f"Invalid JSON schema block: {exc}") from exc
if not isinstance(data, dict):
raise InvalidSchemaFormatError("JSON schema block must contain an object")
parsed_blocks.append(data)
for data in parsed_blocks:
if "$schema" in data or "type" in data:
return data
return parsed_blocks[0]

View File

@@ -0,0 +1,110 @@
"""Validate parsed Markdown documents against JSON Schema."""
from __future__ import annotations
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Any
from jsonschema import Draft202012Validator, SchemaError, ValidationError
from markitect_tool.core import Document, parse_markdown_file
from markitect_tool.schema.loader import MarkdownSchema, load_schema_file
@dataclass(frozen=True)
class ValidationViolation:
"""A single schema validation violation."""
path: str
message: str
schema_path: str
def to_dict(self) -> dict[str, str]:
return asdict(self)
@dataclass(frozen=True)
class SchemaValidationResult:
"""Validation result for one document and one schema."""
valid: bool
violations: list[ValidationViolation]
document_path: str | None = None
schema_path: str | None = None
def to_dict(self) -> dict[str, Any]:
data = {
"valid": self.valid,
"violations": [violation.to_dict() for violation in self.violations],
"document_path": self.document_path,
"schema_path": self.schema_path,
}
return {key: value for key, value in data.items() if value is not None}
def validate_schema(schema: dict[str, Any]) -> SchemaValidationResult:
"""Validate that a JSON Schema itself is well formed."""
try:
Draft202012Validator.check_schema(schema)
except SchemaError as exc:
return SchemaValidationResult(
valid=False,
violations=[
ValidationViolation(
path=_format_path(exc.path),
message=exc.message,
schema_path=_format_path(exc.schema_path),
)
],
)
return SchemaValidationResult(valid=True, violations=[])
def validate_markdown_file(
markdown_path: str | Path, schema_path: str | Path
) -> SchemaValidationResult:
"""Parse and validate a Markdown file against a Markdown schema file."""
document = parse_markdown_file(markdown_path)
loaded_schema = load_schema_file(schema_path)
return validate_document(document, loaded_schema)
def validate_document(
document: Document, schema: MarkdownSchema | dict[str, Any]
) -> SchemaValidationResult:
"""Validate a parsed document against a loaded or raw JSON Schema."""
raw_schema = schema.schema if isinstance(schema, MarkdownSchema) else schema
schema_path = schema.source_path if isinstance(schema, MarkdownSchema) else None
schema_check = validate_schema(raw_schema)
if not schema_check.valid:
return SchemaValidationResult(
valid=False,
violations=schema_check.violations,
document_path=document.source_path,
schema_path=schema_path,
)
validator = Draft202012Validator(raw_schema)
violations = [
ValidationViolation(
path=_format_path(error.path),
message=error.message,
schema_path=_format_path(error.schema_path),
)
for error in sorted(validator.iter_errors(document.to_dict()), key=str)
]
return SchemaValidationResult(
valid=not violations,
violations=violations,
document_path=document.source_path,
schema_path=schema_path,
)
def _format_path(path: Any) -> str:
parts = [str(part) for part in path]
return "$" if not parts else "$." + ".".join(parts)

View File

@@ -0,0 +1,19 @@
---
version: "1.0.0"
---
# Simple Document Schema
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["headings"],
"properties": {
"headings": {
"type": "array",
"minItems": 1
}
}
}
```

3
tests/fixtures/valid-document.md vendored Normal file
View File

@@ -0,0 +1,3 @@
# Hello
World.

View File

@@ -0,0 +1,164 @@
from pathlib import Path
from click.testing import CliRunner
from markitect_tool.cli import main
from markitect_tool.schema import (
InvalidSchemaFormatError,
SchemaNotFoundError,
load_schema_file,
validate_markdown_file,
validate_schema,
)
SCHEMA_TEXT = """---
schema-id: "https://example.test/schemas/document/v1"
version: "1.0.0"
status: "stable"
---
# Document Schema
## Schema Definition
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Document Schema",
"type": "object",
"required": ["frontmatter", "headings"],
"properties": {
"frontmatter": {
"type": "object",
"required": ["title"],
"properties": {
"title": {"type": "string"}
}
},
"headings": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["level", "text"],
"properties": {
"level": {"type": "integer"},
"text": {"type": "string"}
}
}
}
}
}
```
"""
def test_load_schema_file_extracts_metadata_and_json_schema(tmp_path: Path):
schema_file = tmp_path / "document-schema.md"
schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
loaded = load_schema_file(schema_file)
assert loaded.metadata["schema-id"] == "https://example.test/schemas/document/v1"
assert loaded.metadata["status"] == "stable"
assert loaded.schema["title"] == "Document Schema"
assert loaded.schema["x-markitect-source"]["format"] == "markdown"
assert loaded.source_path == str(schema_file)
def test_load_schema_file_requires_json_block(tmp_path: Path):
schema_file = tmp_path / "missing.md"
schema_file.write_text("# Missing\n\nNo schema.", encoding="utf-8")
try:
load_schema_file(schema_file)
except SchemaNotFoundError as exc:
assert "No JSON schema found" in str(exc)
else:
raise AssertionError("expected SchemaNotFoundError")
def test_load_schema_file_rejects_invalid_json(tmp_path: Path):
schema_file = tmp_path / "invalid.md"
schema_file.write_text("```json\n{invalid json}\n```", encoding="utf-8")
try:
load_schema_file(schema_file)
except InvalidSchemaFormatError as exc:
assert "Invalid JSON schema block" in str(exc)
else:
raise AssertionError("expected InvalidSchemaFormatError")
def test_validate_markdown_file_returns_valid_result(tmp_path: Path):
schema_file = tmp_path / "document-schema.md"
schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
markdown_file = tmp_path / "document.md"
markdown_file.write_text("---\ntitle: Example\n---\n\n# Example\n\nBody.", encoding="utf-8")
result = validate_markdown_file(markdown_file, schema_file)
assert result.valid is True
assert result.violations == []
assert result.document_path == str(markdown_file)
assert result.schema_path == str(schema_file)
def test_validate_markdown_file_reports_violations(tmp_path: Path):
schema_file = tmp_path / "document-schema.md"
schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
markdown_file = tmp_path / "document.md"
markdown_file.write_text("# Missing Title\n\nBody.", encoding="utf-8")
result = validate_markdown_file(markdown_file, schema_file)
assert result.valid is False
assert result.violations
assert result.violations[0].path == "$.frontmatter"
assert "title" in result.violations[0].message
def test_validate_schema_reports_invalid_schema():
result = validate_schema({"type": 7})
assert result.valid is False
assert result.violations
def test_mkt_validate_exits_zero_for_valid_document(tmp_path: Path):
schema_file = tmp_path / "document-schema.md"
schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
markdown_file = tmp_path / "document.md"
markdown_file.write_text("---\ntitle: Example\n---\n\n# Example\n", encoding="utf-8")
result = CliRunner().invoke(
main, ["validate", str(markdown_file), "--schema", str(schema_file)]
)
assert result.exit_code == 0
assert "valid" in result.output
def test_mkt_validate_exits_nonzero_for_invalid_document(tmp_path: Path):
schema_file = tmp_path / "document-schema.md"
schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
markdown_file = tmp_path / "document.md"
markdown_file.write_text("# Missing Title\n", encoding="utf-8")
result = CliRunner().invoke(
main, ["validate", str(markdown_file), "--schema", str(schema_file)]
)
assert result.exit_code == 1
assert "invalid" in result.output
def test_mkt_schema_validate(tmp_path: Path):
schema_file = tmp_path / "document-schema.md"
schema_file.write_text(SCHEMA_TEXT, encoding="utf-8")
result = CliRunner().invoke(main, ["schema", "validate", str(schema_file)])
assert result.exit_code == 0
assert "valid" in result.output

View File

@@ -52,7 +52,7 @@ sections, content blocks, parser tokens, API access, and `mkt parse`.
```task ```task
id: MKTT-WP-0003-T003 id: MKTT-WP-0003-T003
status: todo status: done
priority: high priority: high
state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308" state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308"
``` ```
@@ -60,6 +60,9 @@ state_hub_task_id: "36a22def-d415-4c08-a793-836ee52e4308"
Implement FR-010 through FR-012: define/derive schemas, validate documents, Implement FR-010 through FR-012: define/derive schemas, validate documents,
and report structured violations with file/location context. and report structured violations with file/location context.
Initial implementation complete for Markdown schema loading, JSON Schema
validation, structured violations, `mkt validate`, and `mkt schema validate`.
## P3.4 - Implement query and extraction ## P3.4 - Implement query and extraction
```task ```task

View File

@@ -0,0 +1,154 @@
---
id: MKTT-WP-0004
type: workplan
title: "Practical Document Contract Framework"
domain: markitect
status: proposed
owner: markitect-tool
topic_slug: markitect
created: "2026-05-03"
updated: "2026-05-03"
---
# MKTT-WP-0004: Practical Document Contract Framework
## Purpose
Improve the practical utility of `markitect-tool` by moving beyond generic
heading-count schema validation toward document contracts with section
specifications, fields/forms, context-aware rules, metric bands, optional LLM
assessments, and unified diagnostics.
## Background
Research and legacy comparison are captured in:
- `docs/practical-schema-framework-research.md`
- `docs/markitect-main-scope-assessment.md`
- `docs/markitect-main-test-migration-inventory.md`
## P4.1 - Define contract terminology and file format
```task
id: MKTT-WP-0004-T001
status: todo
priority: high
```
Define the first `DocumentContract` format in markdown/YAML:
- document type
- section specifications
- field/form specifications
- deterministic rules/assertions
- metric bands
- optional assessment rubrics
- diagnostic metadata
Keep it provider-neutral and readable by humans.
## P4.2 - Implement unified diagnostic model
```task
id: MKTT-WP-0004-T002
status: todo
priority: high
```
Create diagnostics with severity, code, message, source location, contract
location, rule id, and optional repair guidance. Use this model for JSON Schema
violations and all new contract checks.
## P4.3 - Implement section specifications
```task
id: MKTT-WP-0004-T003
status: todo
priority: high
```
Support required, recommended, optional, discouraged, and forbidden sections.
Support aliases, expected heading level, section type, ordering constraints,
and clear diagnostics.
## P4.4 - Implement metric bands
```task
id: MKTT-WP-0004-T004
status: todo
priority: medium
```
Support document-level and section-level bands for words, characters,
sentences, paragraphs, sections, list items, code blocks, and nesting depth.
Allow soft warnings and hard errors.
## P4.5 - Design form and context model
```task
id: MKTT-WP-0004-T005
status: todo
priority: medium
```
Specify fields, defaults, prefill sources, dynamic requiredness, conditional
visibility, calculations, and validation against external context. This task is
design-first; implementation can follow in a later workplan.
## P4.6 - Design LLM assessment adapter contract
```task
id: MKTT-WP-0004-T006
status: todo
priority: medium
```
Define provider-neutral request/response models for section-level rubrics:
criteria, inputs, context, score, pass/fail, reason, model metadata, and cache
keys. Do not bind core logic to any provider.
## P4.7 - Add practical CLI surface
```task
id: MKTT-WP-0004-T007
status: todo
priority: high
```
Add:
```text
mkt contract validate <contract.md>
mkt contract check <document.md> --contract <contract.md>
mkt metrics <document.md>
```
Ensure output is useful to humans and machines.
## P4.8 - Build use-case examples
```task
id: MKTT-WP-0004-T008
status: todo
priority: medium
```
Create examples for:
- ADR
- PRD/FRS
- workplan
- personalized/business letter
- concept note or entity profile
Each example should include contract, valid document, invalid document, and
expected diagnostics.
## Decision Point
This workplan should probably run before WP-0003 query/transform/cache work,
because it changes what "validation" means and establishes the diagnostic model
that later query/transform/generation features should reuse.
If postponed, continue WP-0003 with query/extraction only if we commit to
revisiting diagnostics and contract semantics before generation or LLM hooks.