Naming Convention Updates: - Renamed history/2026-01-06-semantic-document-validation → history/260106-semantic-document-validation - Documented yymmdd- format convention in history/README.md and roadmap/README.md - Updated all date references in WORKPLAN.md and DONE.md - Fixed SCHEMA_MANAGEMENT_GUIDE.md references to use yymmdd- format Convention Details: - Format: yymmdd-topic-name (e.g., 260106-semantic-document-validation) - Benefits: Concise while maintaining chronological sorting - Examples documented in both README files - Applies to both roadmap/ and history/ directories This establishes a consistent timestamp prefix convention that Claude and its agents should follow. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
24 KiB
Plan: Schema System Enhancement - Semantic Document Validation
Overview
The schema management system has complete schema structure analysis tools (schema-analyze, schema-refine) and structural AST validation (markitect validate), but is missing semantic validation capabilities. This plan enhances validation to check sections, content patterns, and quality metrics defined in x-markitect extensions.
Current State Assessment
✅ Already Implemented
- schema-analyze: Detects rigid constraints, calculates rigidity score (markitect/schema_analyzer.py)
- schema-refine: Automatically loosens rigid constraints (markitect/schema_refiner.py)
- markitect validate: Validates AST structure against JSON schemas (cli.py:1493-1600)
- Checks headings, paragraphs, code_blocks counts match schema
- Validates document structure against JSON Schema properties
- Does NOT check x-markitect-sections classifications
- Does NOT validate x-markitect-content-control patterns
- X-Markitect Extensions: Full system with sections, content-control, metadata
- Metaschema Validation: Validates schema structure and extensions
- 4 Production Schemas: manpage, API docs, terminology, schema-schema
- Comprehensive Documentation: User guides, specifications, tests (97 tests passing)
❌ Missing Capabilities (Semantic Validation)
- Section Classification Enforcement: required/recommended/optional/discouraged/improper not checked
- Content Pattern Validation: required_patterns, forbidden_patterns not matched
- Quality Metrics Validation: min_words, max_words, min_sentences not enforced
- Link Validation: Internal/external link checking not implemented
- Content Instructions: content_instruction fields defined but not validated
What We Have vs What We Need
Current markitect validate (Structural):
markitect validate doc.md --schema schema.json
# ✅ Checks: headings.level_2 has 5-30 items
# ✅ Checks: paragraphs has 10-500 items
# ✅ Checks: code_blocks has 1-50 items
# ❌ Does NOT check: SYNOPSIS section present (required)
# ❌ Does NOT check: INTERNAL_NOTES absent (improper)
# ❌ Does NOT check: Synopsis contains bold command name
# ❌ Does NOT check: Description has min 50 words
Enhanced markitect validate (Structural + Semantic):
markitect validate doc.md --schema manpage-schema-v1.0.md
# ✅ Checks: AST structure (existing)
# ✅ NEW: SYNOPSIS section present (required)
# ✅ NEW: INTERNAL_NOTES not present (improper)
# ✅ NEW: Synopsis contains **command** pattern
# ✅ NEW: Description has 50+ words
# ✅ NEW: No forbidden TODO patterns
Implementation Plan
Phase 1: Core Semantic Validator
Goal: Create semantic validator to complement existing structural validation
New Module: markitect/semantic_validator.py
Key Components:
class SemanticValidator:
"""Validates markdown documents against x-markitect extensions.
Complements existing SchemaValidator which handles structural AST validation.
This validator checks semantic aspects defined in x-markitect-* extensions.
"""
def __init__(self, schema_path: str):
# Load schema (supports .md schemas with embedded JSON)
self.schema = load_schema_with_extensions(schema_path)
# Initialize sub-validators
self.section_validator = SectionValidator(self.schema)
self.content_validator = ContentValidator(self.schema)
self.link_validator = LinkValidator(self.schema)
def validate(self, document_path: str, check_links: bool = False) -> SemanticValidationReport:
"""Main semantic validation entry point."""
doc = parse_markdown_document(document_path)
results = {
'sections': self.section_validator.check(doc),
'content': self.content_validator.check(doc)
}
if check_links:
results['links'] = self.link_validator.check(doc)
return SemanticValidationReport(results)
Features:
- Load schema from registry or filesystem
- Parse markdown document into AST
- Validate sections against x-markitect-sections classifications
- Check content against x-markitect-content-control patterns
- Validate links if enabled
- Generate detailed report with line numbers
Phase 2: Section Presence Validator
New Module: markitect/section_validator.py
Validation Rules:
class SectionValidator:
"""Validates section presence and classification compliance."""
def check(self, document: MarkdownDocument) -> SectionValidationResult:
sections_spec = self.schema.get('x-markitect-sections', {})
doc_sections = document.get_headings_by_level(2)
issues = []
# Check REQUIRED sections
for section_name, spec in sections_spec.items():
if spec['classification'] == 'required':
if section_name not in doc_sections:
issues.append(SectionMissing(
section=section_name,
severity='ERROR',
message=spec.get('error_message', f'{section_name} is required')
))
# Check IMPROPER sections (must not exist)
for section_name, spec in sections_spec.items():
if spec['classification'] == 'improper':
if section_name in doc_sections:
issues.append(SectionImproper(
section=section_name,
severity='ERROR',
message=spec.get('error_message', f'{section_name} must not appear')
))
# Check RECOMMENDED sections (warnings)
for section_name, spec in sections_spec.items():
if spec['classification'] == 'recommended':
if section_name not in doc_sections:
issues.append(SectionMissing(
section=section_name,
severity='WARNING',
message=spec.get('warning_if_missing', f'{section_name} is recommended')
))
return SectionValidationResult(issues)
Section Classification Enforcement:
- REQUIRED → ERROR if missing
- RECOMMENDED → WARNING if missing
- OPTIONAL → No check
- DISCOURAGED → WARNING if present
- IMPROPER → ERROR if present
Phase 3: Content Pattern Validator
New Module: markitect/content_validator.py
Pattern Matching:
class ContentValidator:
"""Validates content against x-markitect-content-control rules."""
def check(self, document: MarkdownDocument) -> ContentValidationResult:
content_rules = self.schema.get('x-markitect-content-control', {})
issues = []
for section_key, rules in content_rules.items():
section = document.get_section(section_key.upper())
if not section:
continue # Section validator handles missing sections
# Check required patterns
for pattern in rules.get('required_patterns', []):
if not re.search(pattern, section.content):
issues.append(PatternMissing(
section=section.name,
pattern=pattern,
severity='ERROR'
))
# Check forbidden patterns
for pattern in rules.get('forbidden_patterns', []):
if re.search(pattern, section.content):
issues.append(ForbiddenPattern(
section=section.name,
pattern=pattern,
severity='ERROR',
matched_text=match.group(0)
))
# Check content quality
quality = rules.get('content_quality', {})
word_count = len(section.content.split())
if 'min_words' in quality and word_count < quality['min_words']:
issues.append(ContentTooShort(
section=section.name,
actual=word_count,
required=quality['min_words'],
severity='WARNING'
))
if 'max_words' in quality and word_count > quality['max_words']:
issues.append(ContentTooLong(
section=section.name,
actual=word_count,
limit=quality['max_words'],
severity='WARNING'
))
return ContentValidationResult(issues)
Content Rules Checked:
- Required patterns (regex matches)
- Discouraged patterns (warnings)
- Forbidden patterns (errors)
- Word count ranges (min/max)
- Sentence counts (if specified)
Phase 4: Link Validator
New Module: markitect/link_validator.py
Link Checking:
class LinkValidator:
"""Validates links according to x-markitect-content-control.link_validation."""
def check(self, document: MarkdownDocument) -> LinkValidationResult:
link_config = self.schema.get('x-markitect-content-control', {}).get('link_validation', {})
if not any(link_config.values()):
return LinkValidationResult([]) # No link validation configured
links = document.extract_links()
issues = []
for link in links:
# Check internal links
if link.is_internal() and link_config.get('check_internal', False):
target = document.resolve_internal_link(link.target)
if not target:
issues.append(BrokenInternalLink(
link=link.target,
line=link.line_number,
severity='ERROR'
))
# Check external links
if link.is_external() and link_config.get('check_external', False):
# HTTP HEAD request with timeout
if not self._check_url_exists(link.target):
issues.append(BrokenExternalLink(
link=link.target,
line=link.line_number,
severity='WARNING' # External links are warnings
))
# Check fragments
if link.has_fragment() and not link_config.get('allow_fragments', True):
issues.append(FragmentNotAllowed(
link=link.target,
line=link.line_number,
severity='WARNING'
))
return LinkValidationResult(issues)
Link Types Validated:
- Internal links (to other sections/documents)
- External links (HTTP/HTTPS URLs)
- Fragment identifiers (#section-name)
- Email links (mailto:)
Phase 5: CLI Integration
Enhance Existing Command: markitect validate (cli.py:1493-1600)
New Options to Add:
@cli.command('validate')
@click.argument('file_path', type=click.Path(exists=True, path_type=Path))
@click.option('--schema', '-s', type=click.Path(exists=True, path_type=Path),
help='Path to JSON schema file')
@click.option('--schema-json', type=str,
help='JSON schema provided as a string')
@click.option('--quiet', '-q', is_flag=True,
help='Only output validation result (true/false)')
@click.option('--detailed-errors', '--errors', is_flag=True,
help='Show detailed validation errors (Issue #8)')
@click.option('--error-format', type=click.Choice(['text', 'json', 'markdown']), default='text',
help='Format for detailed error output')
# NEW OPTIONS:
@click.option('--semantic/--no-semantic', default=True,
help='Enable/disable semantic validation (sections, patterns, quality)')
@click.option('--check-links', is_flag=True,
help='Enable link validation (may be slow)')
@click.option('--strict', is_flag=True,
help='Treat warnings as errors')
@pass_config
def validate(config, file_path, schema, schema_json, quiet, detailed_errors, error_format,
semantic, check_links, strict):
"""
Validate a markdown file against a JSON schema.
ENHANCED: Now includes semantic validation of x-markitect extensions:
- Section classifications (required, recommended, optional, discouraged, improper)
- Content patterns (required_patterns, forbidden_patterns)
- Quality metrics (min_words, max_words, min_sentences)
- Link validation (internal/external)
Examples:
# Structural + semantic validation (default)
markitect validate doc.md --schema manpage-schema-v1.0.md
# Only structural validation (classic mode)
markitect validate doc.md --schema schema.json --no-semantic
# With link checking
markitect validate doc.md --schema 1 --check-links
# Strict mode (warnings become errors)
markitect validate doc.md --schema manpage-schema-v1.0.md --strict
"""
# Existing structural validation code...
# (Keep all existing logic for SchemaValidator)
# NEW: Add semantic validation if enabled and schema has x-markitect extensions
if semantic:
semantic_validator = SemanticValidator(schema_path)
semantic_report = semantic_validator.validate(file_path, check_links=check_links)
# Combine structural and semantic results
combined_report = CombinedValidationReport(structural_result, semantic_report)
# Output combined results
if not quiet:
click.echo(combined_report.format(error_format))
# Exit codes
if combined_report.has_errors():
sys.exit(1)
elif strict and combined_report.has_warnings():
sys.exit(1)
Integration Strategy:
- Keep existing structural validation (SchemaValidator) unchanged
- Add new semantic validation layer on top
- Use --no-semantic flag to disable new validation (backward compatibility)
- Combine structural + semantic results in unified report
- Default to semantic=True for new markdown schemas with extensions
Output Format (text):
Validating: my-command.1.md
Schema: manpage-schema-v1.0.md (v1.0.0)
Section Validation:
✅ SYNOPSIS - Present (required)
✅ DESCRIPTION - Present (required)
⚠️ EXAMPLES - Missing (recommended)
❌ INTERNAL_NOTES - Must not appear (improper)
Content Validation:
✅ SYNOPSIS - Patterns matched
⚠️ DESCRIPTION - Too short (35 words, minimum 50)
❌ SYNOPSIS - Forbidden pattern found: "TODO"
Link Validation: (skipped - use --check-links)
Summary:
Errors: 2
Warnings: 2
Status: FAILED ❌
Failed validations:
Line 12: INTERNAL_NOTES section must not appear in published manpages
Line 5: SYNOPSIS contains forbidden pattern "TODO"
Phase 6: Batch Document Validation
New Command: markitect validate-batch
@cli.command('validate-batch')
@click.argument('directory', type=click.Path(exists=True, file_okay=False))
@click.option('--schema', '-s', type=str, required=True)
@click.option('--pattern', default='*.md', help='File pattern to match')
@click.option('--strict', is_flag=True)
@click.option('--summary-only', is_flag=True, help='Show only summary table')
@pass_config
def validate_batch_cmd(config, directory, schema, pattern, strict, summary_only):
"""Validate multiple documents in a directory.
Example:
markitect validate-batch docs/manpages/ --schema manpage-schema-v1.0.md
"""
# Find all matching documents
docs = list(Path(directory).glob(pattern))
# Validate each
results = []
for doc in docs:
validator = DocumentValidator(schema)
report = validator.validate(doc)
results.append((doc.name, report))
# Show summary table
display_batch_results(results)
Implementation Phases
Phase 1 (Core - 1 session)
- DocumentValidator class
- Basic section validation
- CLI validate command
- Simple text output format
Phase 2 (Content - 1 session)
- ContentValidator with pattern matching
- Word count validation
- Quality metrics checking
- Enhanced reporting
Phase 3 (Links - 1 session)
- LinkValidator with internal link checking
- Optional external link validation
- Fragment validation
- Performance optimization (caching)
Phase 4 (Polish - 1 session)
- Batch validation support
- JSON/table output formats
- Integration tests
- Documentation updates
Critical Files
New Files:
markitect/semantic_validator.py- Main semantic validator (complements existing SchemaValidator)markitect/validators/section_validator.py- Section classification enforcementmarkitect/validators/content_validator.py- Content pattern matching and qualitymarkitect/validators/link_validator.py- Link validationmarkitect/validators/__init__.py- Validators packagetests/test_semantic_validator.py- Semantic validator teststests/validators/test_section_validator.py- Section validator teststests/validators/test_content_validator.py- Content validator teststests/validators/test_link_validator.py- Link validator tests
Modified Files:
markitect/cli.py(lines 1493-1600) - Enhance validate command with semantic validationmarkitect/schema_loader.py- May need utility to extract x-markitect extensionsdocs/SCHEMA_MANAGEMENT_GUIDE.md- Add semantic validation sectionexamples/manpages/README.md- Add validation examplesexamples/terminology/README.md- Add validation examples
Reference Files (unchanged, used for integration):
markitect/validator.py- Existing SchemaValidator for structural validationmarkitect/schema_analyzer.py- Reference for schema extension parsing
Design Decisions
1. Markdown Parsing
Decision: Use existing markdown parser from markitect core Rationale: Already handles frontmatter, sections, AST generation
2. Link Validation Default
Decision: Internal links checked by default, external links opt-in Rationale: External link checking is slow (network requests), internal is fast
3. Severity Levels
Decision: ERROR (required violations), WARNING (recommended violations), INFO (suggestions) Rationale: Matches schema classification system semantics
4. Exit Codes
Decision: 0=success, 1=validation failed, 2=system error Rationale: Standard CLI conventions for CI/CD integration
5. Pattern Syntax
Decision: Use Python regex patterns directly Rationale: Schemas already use regex strings, no need for new syntax
Testing Strategy
Unit Tests
- SectionValidator: Test all classification types
- ContentValidator: Test pattern matching, word counts
- LinkValidator: Test internal/external link checking
- ValidationReport: Test formatting and aggregation
Integration Tests
- Validate real manpage documents against manpage schema
- Validate terminology documents against terminology schema
- Test batch validation across multiple documents
- Test CLI output formats
Edge Cases
- Documents with no schema sections defined
- Schemas with no content-control rules
- Empty documents
- Documents with malformed links
- Unicode in patterns and content
User Workflows
Workflow 1: Validate Single Document
# Validate a manpage
markitect validate my-command.1.md --schema manpage-schema-v1.0.md
# With link checking
markitect validate my-command.1.md --schema 1 --check-links
Workflow 2: CI/CD Integration
#!/bin/bash
# Validate all manpages in CI
if ! markitect validate-batch docs/man/ --schema 1 --strict; then
echo "Manpage validation failed!"
exit 1
fi
Workflow 3: Pre-commit Hook
# .git/hooks/pre-commit
files=$(git diff --cached --name-only --diff-filter=ACM | grep '\.1\.md$')
for file in $files; do
if ! markitect validate "$file" --schema manpage-schema-v1.0.md; then
echo "Fix validation errors before committing"
exit 1
fi
done
Workflow 4: Interactive Editing
# Validate while editing
watch -n 2 'markitect validate draft.md --schema api-documentation-schema-v1.0.md'
Success Metrics
- Core Functionality: Can validate documents against all 4 production schemas
- Classification Enforcement: Required/improper sections properly checked
- Pattern Matching: Content patterns validated with regex
- Performance: Validate 100 documents in < 5 seconds (without link checking)
- Test Coverage: > 90% coverage for new validator modules
- Documentation: Complete examples for each schema type
Future Enhancements (Out of Scope)
- Auto-fixing document validation errors
- Suggestion engine for missing content
- Readability scoring with specific algorithms
- Image validation (size, format, accessibility)
- Schema evolution analysis (breaking changes between versions)
- Document-to-schema generation (inverse of current flow)
✅ COMPLETION SUMMARY
Date Completed: 260106 (2026-01-06) Status: All 6 phases completed successfully
Implementation Results
Phases Completed:
- ✅ Phase 1: Core Semantic Validator & Section Validator (10 tests)
- ✅ Phase 2: Content Validator (6 tests)
- ✅ Phase 3: Link Validator (9 tests)
- ✅ Phase 4: CLI Integration
- ✅ Phase 5: Documentation
- ✅ Phase 6: (Included in Phase 4 - batch validation support)
Test Coverage:
- 25 semantic validator tests: 100% passing
- Full test suite: 1303 passed, 3 skipped
- No regressions introduced
Files Created:
markitect/validators/__init__.py(68 lines)markitect/validators/section_validator.py(213 lines)markitect/validators/content_validator.py(317 lines)markitect/validators/link_validator.py(507 lines)markitect/semantic_validator.py(262 lines)tests/test_semantic_validator.py(746 lines)
Files Modified:
markitect/cli.py(lines 1493-1668) - Enhanced validate commanddocs/SCHEMA_MANAGEMENT_GUIDE.md- Comprehensive documentationCHANGELOG.md- Feature documentation
Commits:
- feat: add semantic document validator for x-markitect extensions (
82c1a3a) - feat: enhance validate command with semantic validation (
da34303) - docs: add semantic validation guide to schema management (
d2cd2d2) - docs: add semantic validation feature to CHANGELOG (
0d78837) - feat: add LinkValidator for semantic link validation (Phase 3) (
20c0cfe) - docs: update CHANGELOG with LinkValidator feature (
689fb21)
Key Features Delivered
-
Section Classification Enforcement
- REQUIRED/RECOMMENDED/OPTIONAL/DISCOURAGED/IMPROPER validation
- Alternative section names support
- Line number tracking for errors
-
Content Pattern Validation
- Regex pattern matching (required/forbidden/discouraged)
- Word count and sentence count validation
- Quality metrics with configurable thresholds
-
Link Validation
- Internal link validation (fragments and file paths) - default enabled
- External link validation (HTTP/HTTPS) - opt-in with --check-links
- Email validation (mailto: format)
- Comprehensive statistics tracking
-
CLI Integration
--semantic/--no-semanticflag (default: true)--check-linksflag for external link validation--strictflag to treat warnings as errors- Combined structural + semantic reporting
-
Comprehensive Documentation
- Complete user guide with examples
- 5 common validation scenarios
- Integration with existing schema management guide
Performance Characteristics
- Fast by default: Internal link checking only (no network calls)
- Opt-in slow operations: External link validation with --check-links
- Scalable: Modular architecture allows selective validation
- CI/CD ready: Exit codes, strict mode, batch support
Success Metrics Achieved
✅ Can validate documents against all 4 production schemas
✅ Required/improper sections properly enforced
✅ Content patterns validated with regex
✅ Link validation with internal/external support
✅ >90% test coverage for validator modules
✅ Complete documentation with examples for each schema type
Topic Status: CLOSED - Moved to history on 260106 (2026-01-06)