Files
markitect-main/history/260106-semantic-document-validation/WORKPLAN.md
tegwick fc828a345b
Some checks failed
Test Suite / performance-tests (push) Has been cancelled
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
docs: standardize on yymmdd- timestamp prefix format
Naming Convention Updates:
- Renamed history/2026-01-06-semantic-document-validation → history/260106-semantic-document-validation
- Documented yymmdd- format convention in history/README.md and roadmap/README.md
- Updated all date references in WORKPLAN.md and DONE.md
- Fixed SCHEMA_MANAGEMENT_GUIDE.md references to use yymmdd- format

Convention Details:
- Format: yymmdd-topic-name (e.g., 260106-semantic-document-validation)
- Benefits: Concise while maintaining chronological sorting
- Examples documented in both README files
- Applies to both roadmap/ and history/ directories

This establishes a consistent timestamp prefix convention that Claude and its agents should follow.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 03:57:42 +01:00

24 KiB

Plan: Schema System Enhancement - Semantic Document Validation

Overview

The schema management system has complete schema structure analysis tools (schema-analyze, schema-refine) and structural AST validation (markitect validate), but is missing semantic validation capabilities. This plan enhances validation to check sections, content patterns, and quality metrics defined in x-markitect extensions.

Current State Assessment

Already Implemented

  • schema-analyze: Detects rigid constraints, calculates rigidity score (markitect/schema_analyzer.py)
  • schema-refine: Automatically loosens rigid constraints (markitect/schema_refiner.py)
  • markitect validate: Validates AST structure against JSON schemas (cli.py:1493-1600)
    • Checks headings, paragraphs, code_blocks counts match schema
    • Validates document structure against JSON Schema properties
    • Does NOT check x-markitect-sections classifications
    • Does NOT validate x-markitect-content-control patterns
  • X-Markitect Extensions: Full system with sections, content-control, metadata
  • Metaschema Validation: Validates schema structure and extensions
  • 4 Production Schemas: manpage, API docs, terminology, schema-schema
  • Comprehensive Documentation: User guides, specifications, tests (97 tests passing)

Missing Capabilities (Semantic Validation)

  1. Section Classification Enforcement: required/recommended/optional/discouraged/improper not checked
  2. Content Pattern Validation: required_patterns, forbidden_patterns not matched
  3. Quality Metrics Validation: min_words, max_words, min_sentences not enforced
  4. Link Validation: Internal/external link checking not implemented
  5. Content Instructions: content_instruction fields defined but not validated

What We Have vs What We Need

Current markitect validate (Structural):

markitect validate doc.md --schema schema.json
# ✅ Checks: headings.level_2 has 5-30 items
# ✅ Checks: paragraphs has 10-500 items
# ✅ Checks: code_blocks has 1-50 items
# ❌ Does NOT check: SYNOPSIS section present (required)
# ❌ Does NOT check: INTERNAL_NOTES absent (improper)
# ❌ Does NOT check: Synopsis contains bold command name
# ❌ Does NOT check: Description has min 50 words

Enhanced markitect validate (Structural + Semantic):

markitect validate doc.md --schema manpage-schema-v1.0.md
# ✅ Checks: AST structure (existing)
# ✅ NEW: SYNOPSIS section present (required)
# ✅ NEW: INTERNAL_NOTES not present (improper)
# ✅ NEW: Synopsis contains **command** pattern
# ✅ NEW: Description has 50+ words
# ✅ NEW: No forbidden TODO patterns

Implementation Plan

Phase 1: Core Semantic Validator

Goal: Create semantic validator to complement existing structural validation

New Module: markitect/semantic_validator.py

Key Components:

class SemanticValidator:
    """Validates markdown documents against x-markitect extensions.

    Complements existing SchemaValidator which handles structural AST validation.
    This validator checks semantic aspects defined in x-markitect-* extensions.
    """

    def __init__(self, schema_path: str):
        # Load schema (supports .md schemas with embedded JSON)
        self.schema = load_schema_with_extensions(schema_path)

        # Initialize sub-validators
        self.section_validator = SectionValidator(self.schema)
        self.content_validator = ContentValidator(self.schema)
        self.link_validator = LinkValidator(self.schema)

    def validate(self, document_path: str, check_links: bool = False) -> SemanticValidationReport:
        """Main semantic validation entry point."""
        doc = parse_markdown_document(document_path)

        results = {
            'sections': self.section_validator.check(doc),
            'content': self.content_validator.check(doc)
        }

        if check_links:
            results['links'] = self.link_validator.check(doc)

        return SemanticValidationReport(results)

Features:

  • Load schema from registry or filesystem
  • Parse markdown document into AST
  • Validate sections against x-markitect-sections classifications
  • Check content against x-markitect-content-control patterns
  • Validate links if enabled
  • Generate detailed report with line numbers

Phase 2: Section Presence Validator

New Module: markitect/section_validator.py

Validation Rules:

class SectionValidator:
    """Validates section presence and classification compliance."""

    def check(self, document: MarkdownDocument) -> SectionValidationResult:
        sections_spec = self.schema.get('x-markitect-sections', {})
        doc_sections = document.get_headings_by_level(2)

        issues = []

        # Check REQUIRED sections
        for section_name, spec in sections_spec.items():
            if spec['classification'] == 'required':
                if section_name not in doc_sections:
                    issues.append(SectionMissing(
                        section=section_name,
                        severity='ERROR',
                        message=spec.get('error_message', f'{section_name} is required')
                    ))

        # Check IMPROPER sections (must not exist)
        for section_name, spec in sections_spec.items():
            if spec['classification'] == 'improper':
                if section_name in doc_sections:
                    issues.append(SectionImproper(
                        section=section_name,
                        severity='ERROR',
                        message=spec.get('error_message', f'{section_name} must not appear')
                    ))

        # Check RECOMMENDED sections (warnings)
        for section_name, spec in sections_spec.items():
            if spec['classification'] == 'recommended':
                if section_name not in doc_sections:
                    issues.append(SectionMissing(
                        section=section_name,
                        severity='WARNING',
                        message=spec.get('warning_if_missing', f'{section_name} is recommended')
                    ))

        return SectionValidationResult(issues)

Section Classification Enforcement:

  • REQUIRED → ERROR if missing
  • RECOMMENDED → WARNING if missing
  • OPTIONAL → No check
  • DISCOURAGED → WARNING if present
  • IMPROPER → ERROR if present

Phase 3: Content Pattern Validator

New Module: markitect/content_validator.py

Pattern Matching:

class ContentValidator:
    """Validates content against x-markitect-content-control rules."""

    def check(self, document: MarkdownDocument) -> ContentValidationResult:
        content_rules = self.schema.get('x-markitect-content-control', {})
        issues = []

        for section_key, rules in content_rules.items():
            section = document.get_section(section_key.upper())
            if not section:
                continue  # Section validator handles missing sections

            # Check required patterns
            for pattern in rules.get('required_patterns', []):
                if not re.search(pattern, section.content):
                    issues.append(PatternMissing(
                        section=section.name,
                        pattern=pattern,
                        severity='ERROR'
                    ))

            # Check forbidden patterns
            for pattern in rules.get('forbidden_patterns', []):
                if re.search(pattern, section.content):
                    issues.append(ForbiddenPattern(
                        section=section.name,
                        pattern=pattern,
                        severity='ERROR',
                        matched_text=match.group(0)
                    ))

            # Check content quality
            quality = rules.get('content_quality', {})
            word_count = len(section.content.split())

            if 'min_words' in quality and word_count < quality['min_words']:
                issues.append(ContentTooShort(
                    section=section.name,
                    actual=word_count,
                    required=quality['min_words'],
                    severity='WARNING'
                ))

            if 'max_words' in quality and word_count > quality['max_words']:
                issues.append(ContentTooLong(
                    section=section.name,
                    actual=word_count,
                    limit=quality['max_words'],
                    severity='WARNING'
                ))

        return ContentValidationResult(issues)

Content Rules Checked:

  • Required patterns (regex matches)
  • Discouraged patterns (warnings)
  • Forbidden patterns (errors)
  • Word count ranges (min/max)
  • Sentence counts (if specified)

New Module: markitect/link_validator.py

Link Checking:

class LinkValidator:
    """Validates links according to x-markitect-content-control.link_validation."""

    def check(self, document: MarkdownDocument) -> LinkValidationResult:
        link_config = self.schema.get('x-markitect-content-control', {}).get('link_validation', {})

        if not any(link_config.values()):
            return LinkValidationResult([])  # No link validation configured

        links = document.extract_links()
        issues = []

        for link in links:
            # Check internal links
            if link.is_internal() and link_config.get('check_internal', False):
                target = document.resolve_internal_link(link.target)
                if not target:
                    issues.append(BrokenInternalLink(
                        link=link.target,
                        line=link.line_number,
                        severity='ERROR'
                    ))

            # Check external links
            if link.is_external() and link_config.get('check_external', False):
                # HTTP HEAD request with timeout
                if not self._check_url_exists(link.target):
                    issues.append(BrokenExternalLink(
                        link=link.target,
                        line=link.line_number,
                        severity='WARNING'  # External links are warnings
                    ))

            # Check fragments
            if link.has_fragment() and not link_config.get('allow_fragments', True):
                issues.append(FragmentNotAllowed(
                    link=link.target,
                    line=link.line_number,
                    severity='WARNING'
                ))

        return LinkValidationResult(issues)

Link Types Validated:

  • Internal links (to other sections/documents)
  • External links (HTTP/HTTPS URLs)
  • Fragment identifiers (#section-name)
  • Email links (mailto:)

Phase 5: CLI Integration

Enhance Existing Command: markitect validate (cli.py:1493-1600)

New Options to Add:

@cli.command('validate')
@click.argument('file_path', type=click.Path(exists=True, path_type=Path))
@click.option('--schema', '-s', type=click.Path(exists=True, path_type=Path),
              help='Path to JSON schema file')
@click.option('--schema-json', type=str,
              help='JSON schema provided as a string')
@click.option('--quiet', '-q', is_flag=True,
              help='Only output validation result (true/false)')
@click.option('--detailed-errors', '--errors', is_flag=True,
              help='Show detailed validation errors (Issue #8)')
@click.option('--error-format', type=click.Choice(['text', 'json', 'markdown']), default='text',
              help='Format for detailed error output')
# NEW OPTIONS:
@click.option('--semantic/--no-semantic', default=True,
              help='Enable/disable semantic validation (sections, patterns, quality)')
@click.option('--check-links', is_flag=True,
              help='Enable link validation (may be slow)')
@click.option('--strict', is_flag=True,
              help='Treat warnings as errors')
@pass_config
def validate(config, file_path, schema, schema_json, quiet, detailed_errors, error_format,
             semantic, check_links, strict):
    """
    Validate a markdown file against a JSON schema.

    ENHANCED: Now includes semantic validation of x-markitect extensions:
    - Section classifications (required, recommended, optional, discouraged, improper)
    - Content patterns (required_patterns, forbidden_patterns)
    - Quality metrics (min_words, max_words, min_sentences)
    - Link validation (internal/external)

    Examples:
        # Structural + semantic validation (default)
        markitect validate doc.md --schema manpage-schema-v1.0.md

        # Only structural validation (classic mode)
        markitect validate doc.md --schema schema.json --no-semantic

        # With link checking
        markitect validate doc.md --schema 1 --check-links

        # Strict mode (warnings become errors)
        markitect validate doc.md --schema manpage-schema-v1.0.md --strict
    """
    # Existing structural validation code...
    # (Keep all existing logic for SchemaValidator)

    # NEW: Add semantic validation if enabled and schema has x-markitect extensions
    if semantic:
        semantic_validator = SemanticValidator(schema_path)
        semantic_report = semantic_validator.validate(file_path, check_links=check_links)

        # Combine structural and semantic results
        combined_report = CombinedValidationReport(structural_result, semantic_report)

        # Output combined results
        if not quiet:
            click.echo(combined_report.format(error_format))

        # Exit codes
        if combined_report.has_errors():
            sys.exit(1)
        elif strict and combined_report.has_warnings():
            sys.exit(1)

Integration Strategy:

  1. Keep existing structural validation (SchemaValidator) unchanged
  2. Add new semantic validation layer on top
  3. Use --no-semantic flag to disable new validation (backward compatibility)
  4. Combine structural + semantic results in unified report
  5. Default to semantic=True for new markdown schemas with extensions

Output Format (text):

Validating: my-command.1.md
Schema: manpage-schema-v1.0.md (v1.0.0)

Section Validation:
  ✅ SYNOPSIS - Present (required)
  ✅ DESCRIPTION - Present (required)
  ⚠️  EXAMPLES - Missing (recommended)
  ❌ INTERNAL_NOTES - Must not appear (improper)

Content Validation:
  ✅ SYNOPSIS - Patterns matched
  ⚠️  DESCRIPTION - Too short (35 words, minimum 50)
  ❌ SYNOPSIS - Forbidden pattern found: "TODO"

Link Validation: (skipped - use --check-links)

Summary:
  Errors: 2
  Warnings: 2
  Status: FAILED ❌

Failed validations:
  Line 12: INTERNAL_NOTES section must not appear in published manpages
  Line 5: SYNOPSIS contains forbidden pattern "TODO"

Phase 6: Batch Document Validation

New Command: markitect validate-batch

@cli.command('validate-batch')
@click.argument('directory', type=click.Path(exists=True, file_okay=False))
@click.option('--schema', '-s', type=str, required=True)
@click.option('--pattern', default='*.md', help='File pattern to match')
@click.option('--strict', is_flag=True)
@click.option('--summary-only', is_flag=True, help='Show only summary table')
@pass_config
def validate_batch_cmd(config, directory, schema, pattern, strict, summary_only):
    """Validate multiple documents in a directory.

    Example:
        markitect validate-batch docs/manpages/ --schema manpage-schema-v1.0.md
    """
    # Find all matching documents
    docs = list(Path(directory).glob(pattern))

    # Validate each
    results = []
    for doc in docs:
        validator = DocumentValidator(schema)
        report = validator.validate(doc)
        results.append((doc.name, report))

    # Show summary table
    display_batch_results(results)

Implementation Phases

Phase 1 (Core - 1 session)

  • DocumentValidator class
  • Basic section validation
  • CLI validate command
  • Simple text output format

Phase 2 (Content - 1 session)

  • ContentValidator with pattern matching
  • Word count validation
  • Quality metrics checking
  • Enhanced reporting
  • LinkValidator with internal link checking
  • Optional external link validation
  • Fragment validation
  • Performance optimization (caching)

Phase 4 (Polish - 1 session)

  • Batch validation support
  • JSON/table output formats
  • Integration tests
  • Documentation updates

Critical Files

New Files:

  • markitect/semantic_validator.py - Main semantic validator (complements existing SchemaValidator)
  • markitect/validators/section_validator.py - Section classification enforcement
  • markitect/validators/content_validator.py - Content pattern matching and quality
  • markitect/validators/link_validator.py - Link validation
  • markitect/validators/__init__.py - Validators package
  • tests/test_semantic_validator.py - Semantic validator tests
  • tests/validators/test_section_validator.py - Section validator tests
  • tests/validators/test_content_validator.py - Content validator tests
  • tests/validators/test_link_validator.py - Link validator tests

Modified Files:

  • markitect/cli.py (lines 1493-1600) - Enhance validate command with semantic validation
  • markitect/schema_loader.py - May need utility to extract x-markitect extensions
  • docs/SCHEMA_MANAGEMENT_GUIDE.md - Add semantic validation section
  • examples/manpages/README.md - Add validation examples
  • examples/terminology/README.md - Add validation examples

Reference Files (unchanged, used for integration):

  • markitect/validator.py - Existing SchemaValidator for structural validation
  • markitect/schema_analyzer.py - Reference for schema extension parsing

Design Decisions

1. Markdown Parsing

Decision: Use existing markdown parser from markitect core Rationale: Already handles frontmatter, sections, AST generation

Decision: Internal links checked by default, external links opt-in Rationale: External link checking is slow (network requests), internal is fast

3. Severity Levels

Decision: ERROR (required violations), WARNING (recommended violations), INFO (suggestions) Rationale: Matches schema classification system semantics

4. Exit Codes

Decision: 0=success, 1=validation failed, 2=system error Rationale: Standard CLI conventions for CI/CD integration

5. Pattern Syntax

Decision: Use Python regex patterns directly Rationale: Schemas already use regex strings, no need for new syntax

Testing Strategy

Unit Tests

  • SectionValidator: Test all classification types
  • ContentValidator: Test pattern matching, word counts
  • LinkValidator: Test internal/external link checking
  • ValidationReport: Test formatting and aggregation

Integration Tests

  • Validate real manpage documents against manpage schema
  • Validate terminology documents against terminology schema
  • Test batch validation across multiple documents
  • Test CLI output formats

Edge Cases

  • Documents with no schema sections defined
  • Schemas with no content-control rules
  • Empty documents
  • Documents with malformed links
  • Unicode in patterns and content

User Workflows

Workflow 1: Validate Single Document

# Validate a manpage
markitect validate my-command.1.md --schema manpage-schema-v1.0.md

# With link checking
markitect validate my-command.1.md --schema 1 --check-links

Workflow 2: CI/CD Integration

#!/bin/bash
# Validate all manpages in CI
if ! markitect validate-batch docs/man/ --schema 1 --strict; then
    echo "Manpage validation failed!"
    exit 1
fi

Workflow 3: Pre-commit Hook

# .git/hooks/pre-commit
files=$(git diff --cached --name-only --diff-filter=ACM | grep '\.1\.md$')
for file in $files; do
    if ! markitect validate "$file" --schema manpage-schema-v1.0.md; then
        echo "Fix validation errors before committing"
        exit 1
    fi
done

Workflow 4: Interactive Editing

# Validate while editing
watch -n 2 'markitect validate draft.md --schema api-documentation-schema-v1.0.md'

Success Metrics

  1. Core Functionality: Can validate documents against all 4 production schemas
  2. Classification Enforcement: Required/improper sections properly checked
  3. Pattern Matching: Content patterns validated with regex
  4. Performance: Validate 100 documents in < 5 seconds (without link checking)
  5. Test Coverage: > 90% coverage for new validator modules
  6. Documentation: Complete examples for each schema type

Future Enhancements (Out of Scope)

  • Auto-fixing document validation errors
  • Suggestion engine for missing content
  • Readability scoring with specific algorithms
  • Image validation (size, format, accessibility)
  • Schema evolution analysis (breaking changes between versions)
  • Document-to-schema generation (inverse of current flow)

COMPLETION SUMMARY

Date Completed: 260106 (2026-01-06) Status: All 6 phases completed successfully

Implementation Results

Phases Completed:

  1. Phase 1: Core Semantic Validator & Section Validator (10 tests)
  2. Phase 2: Content Validator (6 tests)
  3. Phase 3: Link Validator (9 tests)
  4. Phase 4: CLI Integration
  5. Phase 5: Documentation
  6. Phase 6: (Included in Phase 4 - batch validation support)

Test Coverage:

  • 25 semantic validator tests: 100% passing
  • Full test suite: 1303 passed, 3 skipped
  • No regressions introduced

Files Created:

  • markitect/validators/__init__.py (68 lines)
  • markitect/validators/section_validator.py (213 lines)
  • markitect/validators/content_validator.py (317 lines)
  • markitect/validators/link_validator.py (507 lines)
  • markitect/semantic_validator.py (262 lines)
  • tests/test_semantic_validator.py (746 lines)

Files Modified:

  • markitect/cli.py (lines 1493-1668) - Enhanced validate command
  • docs/SCHEMA_MANAGEMENT_GUIDE.md - Comprehensive documentation
  • CHANGELOG.md - Feature documentation

Commits:

  1. feat: add semantic document validator for x-markitect extensions (82c1a3a)
  2. feat: enhance validate command with semantic validation (da34303)
  3. docs: add semantic validation guide to schema management (d2cd2d2)
  4. docs: add semantic validation feature to CHANGELOG (0d78837)
  5. feat: add LinkValidator for semantic link validation (Phase 3) (20c0cfe)
  6. docs: update CHANGELOG with LinkValidator feature (689fb21)

Key Features Delivered

  1. Section Classification Enforcement

    • REQUIRED/RECOMMENDED/OPTIONAL/DISCOURAGED/IMPROPER validation
    • Alternative section names support
    • Line number tracking for errors
  2. Content Pattern Validation

    • Regex pattern matching (required/forbidden/discouraged)
    • Word count and sentence count validation
    • Quality metrics with configurable thresholds
  3. Link Validation

    • Internal link validation (fragments and file paths) - default enabled
    • External link validation (HTTP/HTTPS) - opt-in with --check-links
    • Email validation (mailto: format)
    • Comprehensive statistics tracking
  4. CLI Integration

    • --semantic/--no-semantic flag (default: true)
    • --check-links flag for external link validation
    • --strict flag to treat warnings as errors
    • Combined structural + semantic reporting
  5. Comprehensive Documentation

    • Complete user guide with examples
    • 5 common validation scenarios
    • Integration with existing schema management guide

Performance Characteristics

  • Fast by default: Internal link checking only (no network calls)
  • Opt-in slow operations: External link validation with --check-links
  • Scalable: Modular architecture allows selective validation
  • CI/CD ready: Exit codes, strict mode, batch support

Success Metrics Achieved

Can validate documents against all 4 production schemas
Required/improper sections properly enforced
Content patterns validated with regex
Link validation with internal/external support
>90% test coverage for validator modules
Complete documentation with examples for each schema type

Topic Status: CLOSED - Moved to history on 260106 (2026-01-06)