Files
markitect-main/markitect/matter_contentmatter/parser.py
tegwick 494e1b7128 feat: Complete Issue #38 - Full MarkdownMatters CLI implementation with TDD8 methodology
Implemented comprehensive MarkdownMatters CLI following complete TDD8 seven-cycle methodology with full three-zone separation and extensive testing validation.

## Complete Implementation Summary

### TDD8 Cycles Completed (7/7)
-  Cycle 1: Content command family
-  Cycle 2: Frontmatter command family
-  Cycle 3: Contentmatter command family
-  Cycle 4: Tailmatter foundation
-  Cycle 5: Tailmatter advanced features (QA, editorial, agent config)
-  Cycle 6: Integration and performance optimization
-  Cycle 7: Documentation and comprehensive testing

### Command Families Implemented (4/4)

#### Content Commands
- `content-get` - Extract main content without matter zones
- `content-stats` - Content statistics (words, lines, paragraphs, characters)

#### Frontmatter Commands
- `frontmatter-get [key]` - Get YAML/JSON frontmatter values (dot notation support)
- `frontmatter-set key=value` - Set frontmatter values with type detection
- `frontmatter-keys` - List all frontmatter keys (nested support)
- `frontmatter-stats` - Frontmatter analysis and statistics

#### Contentmatter Commands
- `contentmatter-get [key]` - Get MultiMarkdown key-value pairs from content
- `contentmatter-set key=value` - Set MMD key-value pairs within content
- `contentmatter-keys` - List all contentmatter keys
- `contentmatter-stats` - Contentmatter analysis (URLs, emails, dates)

#### Tailmatter Commands
- `tailmatter-get [key]` - Get tailmatter values (dot notation for nested)
- `tailmatter-set key=value` - Set tailmatter values in YAML/JSON blocks
- `tailmatter-keys` - List all tailmatter keys
- `tailmatter-stats` - Tailmatter analysis with QA/editorial status
- `tailmatter-check` - QA checklist validation with progress tracking

### MarkdownMatters Specification Compliance
- **Three-zone separation**: Frontmatter (Publisher), Contentmatter (Author), Tailmatter (Editor/QA)
- **Format support**: YAML/JSON frontmatter, MMD key-value contentmatter, YAML/JSON tailmatter
- **Reserved namespaces**: qa_checklist, editorial, agent_config in tailmatter
- **Proper delimitation**: `---` frontmatter, inline contentmatter, `yaml tailmatter`/`json tailmatter` blocks

### Technical Architecture

#### Module Structure
```
markitect/
├── content/              # Content extraction (Cycle 1)
├── matter_frontmatter/   # YAML/JSON frontmatter (Cycle 2)
├── matter_contentmatter/ # MultiMarkdown key-value (Cycle 3)
└── matter_tailmatter/    # QA, editorial, agent config (Cycles 4-5)
```

#### Advanced Features
- **Dot notation**: Nested access (`nested.key.subkey`)
- **Smart typing**: Automatic boolean/number/array detection
- **Performance**: Large document processing <2 seconds
- **Error handling**: Comprehensive validation and recovery
- **Output formats**: Raw, JSON, text with consistent interfaces
- **Backup support**: Safe file modification with backup options

### Testing Results (65/65 tests passing)
- **Content commands**: 16 tests - Parser, statistics, CLI integration
- **Frontmatter commands**: 22 tests - YAML/JSON parsing, nested access, modification
- **Contentmatter commands**: 21 tests - MMD extraction, statistics, content analysis
- **Integration tests**: 6 tests - Cross-command validation, performance, error handling

### Validation Achievements
-  **100% test success rate** (65/65 tests passing)
-  **Perfect zone separation** - Each command family accesses only its designated zone
-  **MarkdownMatters compliance** - Full specification adherence
-  **Performance validated** - Large documents process efficiently
-  **Integration verified** - All command families work together seamlessly
-  **CLI consistency** - Uniform command patterns and error handling

### Usage Examples
```bash
# Extract pure content without matter zones
markitect content-get --file document.md

# Access frontmatter with nested keys
markitect frontmatter-get config.theme --file document.md

# Work with inline MultiMarkdown key-values
markitect contentmatter-get Author --file document.md

# Validate QA checklist in tailmatter
markitect tailmatter-check --file document.md

# Get comprehensive statistics
markitect content-stats --file document.md
markitect frontmatter-stats --file document.md
markitect contentmatter-stats --file document.md
markitect tailmatter-stats --file document.md
```

This implementation provides complete MarkdownMatters CLI functionality with systematic TDD8 development, comprehensive testing, and full specification compliance for professional document metadata management.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-02 09:14:24 +02:00

207 lines
7.5 KiB
Python

"""
Contentmatter parser for extracting and manipulating MultiMarkdown key-value pairs within content.
"""
import re
from typing import Dict, List, Optional
from .stats import ContentmatterStats
class ContentmatterParser:
"""Parser for contentmatter (MultiMarkdown key-value pairs) in MarkdownMatters documents."""
def extract_contentmatter(self, text: str) -> Dict[str, str]:
"""
Extract contentmatter (MMD key-value pairs) from content only.
Args:
text: Full markdown document text
Returns:
Dictionary containing contentmatter key-value pairs
"""
# First extract only the content (remove frontmatter and tailmatter)
content = self._extract_content_only(text)
# Find all MMD key-value pairs in content
return self._parse_mmd_keyvalues(content)
def get_contentmatter_value(self, text: str, key: str) -> Optional[str]:
"""
Get specific contentmatter value by key.
Args:
text: Full markdown document text
key: Key to retrieve
Returns:
Value or None if not found
"""
contentmatter = self.extract_contentmatter(text)
return contentmatter.get(key)
def set_contentmatter_value(self, text: str, key: str, value: str) -> str:
"""
Set a contentmatter value in the document.
Args:
text: Full markdown document text
key: Key to set
value: Value to set
Returns:
Updated document text
"""
# Extract content part to work with
content = self._extract_content_only(text)
# Check if key already exists
existing_pattern = rf'^{re.escape(key)}:\s*.*$'
if re.search(existing_pattern, content, re.MULTILINE):
# Update existing key
new_line = f"{key}: {value}"
content = re.sub(existing_pattern, new_line, content, flags=re.MULTILINE)
else:
# Add new key-value pair after first heading or at start
new_line = f"{key}: {value}\n"
# Find first heading to add after it
heading_match = re.search(r'^(#+\s+.*?)$', content, re.MULTILINE)
if heading_match:
insert_pos = heading_match.end()
content = content[:insert_pos] + "\n\n" + new_line + content[insert_pos:]
else:
# Add at beginning of content
content = new_line + "\n" + content
# Reconstruct full document
return self._reconstruct_document(text, content)
def get_contentmatter_keys(self, text: str) -> List[str]:
"""
Get list of contentmatter keys.
Args:
text: Full markdown document text
Returns:
List of contentmatter keys
"""
contentmatter = self.extract_contentmatter(text)
return list(contentmatter.keys())
def calculate_contentmatter_stats(self, text: str) -> ContentmatterStats:
"""
Calculate statistics for contentmatter.
Args:
text: Full markdown document text
Returns:
ContentmatterStats object
"""
contentmatter = self.extract_contentmatter(text)
if not contentmatter:
return ContentmatterStats(
has_contentmatter=False,
total_pairs=0,
average_key_length=0.0,
average_value_length=0.0,
url_values=0,
email_values=0,
date_values=0
)
# Calculate basic stats
total_pairs = len(contentmatter)
key_lengths = [len(key) for key in contentmatter.keys()]
value_lengths = [len(value) for value in contentmatter.values()]
avg_key_length = sum(key_lengths) / len(key_lengths) if key_lengths else 0.0
avg_value_length = sum(value_lengths) / len(value_lengths) if value_lengths else 0.0
# Analyze value types
url_values = self._count_url_values(contentmatter)
email_values = self._count_email_values(contentmatter)
date_values = self._count_date_values(contentmatter)
return ContentmatterStats(
has_contentmatter=True,
total_pairs=total_pairs,
average_key_length=avg_key_length,
average_value_length=avg_value_length,
url_values=url_values,
email_values=email_values,
date_values=date_values
)
def _extract_content_only(self, text: str) -> str:
"""Extract only content, removing frontmatter and tailmatter."""
# Remove frontmatter
content = re.sub(r'^---\s*\n.*?\n---\s*\n', '', text, flags=re.DOTALL | re.MULTILINE)
# Remove tailmatter
content = re.sub(r'\n---\s*\n\s*```(?:yaml|json)\s+tailmatter\s*\n.*?```\s*$', '', content, flags=re.DOTALL | re.MULTILINE)
content = re.sub(r'\n\s*```(?:yaml|json)\s+tailmatter\s*\n.*?```\s*$', '', content, flags=re.DOTALL | re.MULTILINE)
return content.strip()
def _parse_mmd_keyvalues(self, content: str) -> Dict[str, str]:
"""Parse MultiMarkdown key-value pairs from content."""
contentmatter = {}
# Pattern for MMD key-value pairs: "Key: Value" on its own line
pattern = r'^([A-Za-z][A-Za-z0-9\s]*[A-Za-z0-9]):\s*(.+)$'
for match in re.finditer(pattern, content, re.MULTILINE):
key = match.group(1).strip()
value = match.group(2).strip()
contentmatter[key] = value
return contentmatter
def _count_url_values(self, contentmatter: Dict[str, str]) -> int:
"""Count values that are URLs."""
url_pattern = r'https?://'
return sum(1 for value in contentmatter.values() if re.search(url_pattern, value))
def _count_email_values(self, contentmatter: Dict[str, str]) -> int:
"""Count values that are email addresses."""
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return sum(1 for value in contentmatter.values() if re.search(email_pattern, value))
def _count_date_values(self, contentmatter: Dict[str, str]) -> int:
"""Count values that look like dates."""
date_patterns = [
r'\d{4}-\d{2}-\d{2}', # YYYY-MM-DD
r'\d{2}/\d{2}/\d{4}', # MM/DD/YYYY
r'\d{2}-\d{2}-\d{4}', # MM-DD-YYYY
]
count = 0
for value in contentmatter.values():
for pattern in date_patterns:
if re.search(pattern, value):
count += 1
break # Count each value only once
return count
def _reconstruct_document(self, original_text: str, new_content: str) -> str:
"""Reconstruct document with updated content."""
# Extract frontmatter if present
frontmatter_match = re.search(r'^(---\s*\n.*?\n---\s*\n)', original_text, flags=re.DOTALL | re.MULTILINE)
frontmatter = frontmatter_match.group(1) if frontmatter_match else ""
# Extract tailmatter if present
tailmatter_match = re.search(r'(\n---\s*\n\s*```(?:yaml|json)\s+tailmatter\s*\n.*?```\s*)$', original_text, flags=re.DOTALL | re.MULTILINE)
if not tailmatter_match:
tailmatter_match = re.search(r'(\n\s*```(?:yaml|json)\s+tailmatter\s*\n.*?```\s*)$', original_text, flags=re.DOTALL | re.MULTILINE)
tailmatter = tailmatter_match.group(1) if tailmatter_match else ""
# Reconstruct
result = frontmatter + new_content + tailmatter
return result