12 KiB
Practical Schema Framework Research
Date: 2026-05-03
Purpose
This document reassesses markitect-tool schema utility before further
implementation. The concern is that pure structural validation, such as heading
counts and min/max depth constraints, is rarely enough to make markdown document
pipelines useful.
The practical opportunity is to define a stronger framework for markdown-native document contracts: section specifications, content assertions, form fields, context-aware rules, LLM-assisted assessments, and high-quality diagnostics.
Research Signals
Structured Authoring
DITA is the strongest analogue for typed, reusable textual units. It emphasizes information typing, semantic markup, modularity, reuse, interchange, and multiple deliverables from one source. A DITA topic is the unit of authoring and reuse; topics may be generic or specialized into roles such as concept, task, or reference.
Relevance for markitect-tool:
- A markdown document or section should have an explicit information type.
- Information type should imply expected structure and reader purpose.
- Reuse and composition need stable addressing of sections, not only files.
- Specialization is a better mental model than ad hoc schema forks.
Sources:
- https://dita-lang.org/dita/archspec/base/basic-concepts
- https://dita-lang.org/dita/archspec/base/introduction-to-dita
Document Schemas With Assertions
DocBook remains relevant because it combines formal document schemas with Schematron-style assertions. That is the missing layer in many simplistic JSON Schema approaches: grammar says what may exist; assertions say what must be true in context.
Relevance for markitect-tool:
- JSON Schema over
Document.to_dict()is useful but insufficient. - We need a second assertion layer for document-specific semantics.
- Diagnostics must point to the document location and rule intention.
Source:
Dynamic Form Rules
JSON Schema supports conditional validation through dependentRequired,
dependentSchemas, and if/then/else. JSON Forms separates data schema
from UI schema and uses rules to show, hide, enable, or disable UI elements
based on JSON Schema conditions. Form.io’s architecture treats the form schema
as a single source of truth for validation and conditional logic across client
and server.
Relevance for markitect-tool:
- Forms should be first-class, not bolted onto document generation.
- Field definitions need static validation and dynamic rules.
- Prefill, visibility, requiredness, and calculated values should come from the same contract used for generation and validation.
- Context data must be explicit and typed.
Sources:
- https://json-schema.org/understanding-json-schema/reference/conditionals
- https://jsonforms.io/docs/uischema/rules/
- https://form.io/features/form-conditional-logic-form-validation/
LLM-Assisted Assessment
Modern evaluation frameworks treat LLM assessment as explicit graders or
rubrics. OpenAI graders return scores in a 0–1 range and can combine grader
types. Promptfoo’s llm-rubric uses explicit criteria and expects structured
judge output with reason, score, and pass/fail.
Relevance for markitect-tool:
- LLM checks should be declared as assessment rules, not hidden in prompts.
- Deterministic validation and LLM assessment should produce one diagnostic model.
- Section-level rubrics are more useful than whole-document vague grading.
- The LLM provider must remain external;
markitect-tooldefines contracts and reports.
Sources:
- https://developers.openai.com/api/docs/guides/graders
- https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/llm-rubric/
Markdown Structure
CommonMark gives markdown a well-defined block/inline model. mdast gives a language-neutral tree vocabulary for Markdown nodes. Both point toward keeping the parse layer separate from domain/schema layers.
Relevance for markitect-tool:
- The core document model should stay close to CommonMark/mdast concepts.
- Practical document contracts should sit above the parse model.
- Section addressing, source spans, and block identity are foundational for good diagnostics.
Sources:
Terminology Proposal
| Term | Meaning |
|---|---|
| Document | A markdown artifact parsed into frontmatter, blocks, headings, sections, and source spans. |
| Section | A heading-led document region with content, children, source location, and stable identity. |
| Document Type | A named contract for a whole document, e.g. ADR, PRD, invoice letter, support reply, concept note. |
| Section Type | A reusable role for a section, e.g. Context, Decision, Risks, Procedure, Evidence, Conclusion. |
| Field | A typed value expected in frontmatter, inline matter, a section, or an external data record. |
| Form | A field collection with UI hints, validation rules, defaults, dynamic visibility, and calculations. |
| Context | External data available during validation/generation, such as user data, project data, dates, or related entities. |
| Rule | A deterministic condition evaluated against document, fields, context, or pipeline state. |
| Assertion | A claim that must hold for content, usually richer than shape validation. |
| Metric Band | A soft or hard target for size/complexity, such as word count, sentence count, section count, or reading level. |
| Assessment | A deterministic or LLM-assisted evaluation that returns pass/fail, score, reason, and diagnostics. |
| Rubric | A human-readable criterion for LLM-assisted assessment, scoped to a document or section type. |
| Diagnostic | A structured finding with severity, code, message, source location, rule id, and suggested repair. |
| Contract | The full specification for a document type: structure, sections, fields, rules, forms, assertions, rubrics, and outputs. |
| Pipeline | A repeatable sequence of parse, prefill, generate, validate, assess, transform, and compose operations. |
Most Relevant Use Cases
UC-001: Typed Document Contract
Define a document type such as ADR, PRD, FRS, workplan, customer letter, or meeting brief. Specify required sections by semantic role, allowed alternatives, field requirements, and diagnostics.
Practical value:
- Prevents missing critical content.
- Makes generated documents predictable.
- Creates an explicit contract for humans and agents.
Needed tooling:
mkt contract check <doc> --contract <contract.md>- Section matching by heading text, aliases, ids, or section type markers.
- Diagnostics that say which section/field/assertion failed and why.
UC-002: Section-Level Content Expectations
Specify what a section is expected to contain: assertions, required evidence, forbidden omissions, content patterns, examples, and reviewer prompts.
Practical value:
- Moves beyond “has a heading” toward “does the section do its job?”
- Enables review of generated or human-authored text.
Needed tooling:
- Deterministic assertions for regex, presence, references, counts, and field values.
- Optional LLM rubrics for semantic content checks.
- Per-section diagnostic reports.
UC-003: Size and Complexity Bands
Define soft/hard bands for document and section size: words, characters, sentences, paragraphs, sections, list items, code blocks, and nesting depth.
Practical value:
- Controls generation output size.
- Keeps templates from becoming bloated or underdeveloped.
- Helps compare intended vs actual document complexity.
Needed tooling:
- Metrics extractor.
- Rule severities: info, warning, error.
- “Too small/too large” diagnostics with actual and target values.
UC-004: Form-Backed Markdown Generation
Define forms that collect or prefill structured fields, then render markdown documents. Fields may be static, calculated, conditional, or context-derived.
Practical value:
- Bridges structured data capture and prose generation.
- Supports repeatable business documents.
- Makes prefill from user/project/entity data explicit.
Needed tooling:
- Field schema.
- UI schema or form hints.
- Dynamic rules for requiredness, visibility, defaults, and calculations.
- Template rendering with validation before and after render.
UC-005: Context-Aware Validation
Validate a document against external context: user data, project metadata, related entities, dates, policy constraints, or canonical terminology.
Practical value:
- Checks whether a document is correct for this case, not only generally well-formed.
- Enables pipelines like personalized letters, compliance reports, and project-specific workplans.
Needed tooling:
- Context object schema.
- Resolvers for local files, JSON/YAML data, and later higher-layer systems.
- Rule expressions that can reference document and context paths.
UC-006: LLM-Assisted Section Assessment
Attach rubrics to section types. Use an external LLM adapter to assess whether a section satisfies the rubric, returning score, reason, and pass/fail.
Practical value:
- Handles semantic checks that deterministic rules cannot.
- Supports review loops for generated text.
- Makes subjective requirements explicit and auditable.
Needed tooling:
- Rubric declaration format.
- Provider-neutral assessment request/response models.
- Caching and reproducibility metadata.
- Clear distinction between deterministic errors and model-judged findings.
UC-007: Pipeline Diagnostics and Repair Guidance
Run a document pipeline and get one coherent diagnostic report from parsing, schema checks, field validation, assertions, generation, composition, and LLM-assisted assessments.
Practical value:
- Makes failures debuggable.
- Helps humans and agents repair documents.
- Avoids scattered errors from unrelated subsystems.
Needed tooling:
- Common diagnostic model.
- Error codes and severities.
- Source spans and rule ids.
- Suggested repair text or structured patches when safe.
Comparison With markitect-main
markitect-main had several useful seeds:
x-markitect-sectionsfor required/recommended/optional/discouraged/improper sections.x-markitect-content-controlfor required, discouraged, and forbidden patterns plus word-count metrics.- Section and content validators with warnings/errors.
- Schema generation and validation experiments.
- Draft generation with
x-markitect-field-mapping. - Prompt quality gates with schema and pattern validators.
- Infospace entity parsing and LLM classification/evaluation.
The problem was not lack of ideas. The problem was that the ideas lived in separate subsystems with different models:
- Schema validation compared generated schemas rather than validating a stable document contract.
- Semantic validation used
x-markitect-*extensions but was not integrated into a unified contract framework. - Field mapping existed in draft generation, not in a general form/context model.
- LLM quality gates existed inside prompt execution, not as provider-neutral document assessments.
- Infospace checks were domain/application layer behavior, not syntax-layer primitives.
Strategic Direction
The successor should introduce a framework layer above parsing:
Markdown parse model
-> document contract
-> section specifications
-> field/form specifications
-> deterministic rules/assertions
-> metric bands
-> optional LLM rubrics
-> unified diagnostics
This should not replace JSON Schema. JSON Schema remains useful for typed data and machine validation. The new layer should make document-specific semantics natural.
Recommendation
Do not continue straight into generic query/transform work until this framework direction is captured. The next implementation slice should be a small, deterministic version of document contracts:
- Define the contract schema and terminology.
- Implement section specifications.
- Implement metric bands.
- Implement the unified diagnostic model.
- Leave LLM rubrics and form dynamics as designed extension points for the next slice.
This is the utility inflection point. It will make markitect-tool practically
useful instead of merely structurally correct.