Parse Markdown files into a structured Python object

2026-05-03 21:37:00 +02:00
parent 2676994b11
commit 705f2c6178
15 changed files with 571 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -20,3 +20,17 @@ requirements documents in `wiki/`.
 The repo is registered with the Custodian State Hub as `markitect-tool` under
 the `markitect` domain. See `docs/state-hub-integration.md`.
 ## Development
 Run the tests:
 ```bash
 python3 -m pytest
 ```
 Try the parser CLI from a checkout:
 ```bash
 PYTHONPATH=src python3 -m markitect_tool parse README.md --format tree
 ```
--- a/docs/packaging-decision.md
+++ b/docs/packaging-decision.md
@@ -0,0 +1,36 @@
 # Packaging Decision
 Date: 2026-05-03
 ## Decision
 `markitect-tool` starts as a Python 3.12+ package with:
 - Distribution name: `markitect-tool`
 - Import package: `markitect_tool`
 - CLI entry point: `mkt`
 - Build backend: `setuptools`
 - Test runner: `pytest`
 - Source layout: `src/markitect_tool`
 ## Initial Dependencies
 Core dependencies:
 - `markdown-it-py`
 - `PyYAML`
 - `click>=8.0`
 Optional extras:
 - `query`: `jsonpath-ng`
 - `tables`: `tabulate`
 - `llm`: `llm-connect`
 - `dev`: `pytest`
 ## Rationale
 This follows the WP-0002 dependency classification and keeps the first
 implementation focused on deterministic markdown parsing and CLI access. The
 package name avoids legacy `markitect.*` imports while the `mkt` entry point
 matches the PRD.
--- a/docs/state-hub-integration.md
+++ b/docs/state-hub-integration.md
@@ -32,8 +32,10 @@ workplans/
 ## Follow-Up
-Once implementation dependencies exist, add an SBOM source and update State Hub
+SBOM source: `sbom-tools.yaml`.
-with the SBOM ingestion result. This seed repo currently has no package manifest.
+
 Initial SBOM ingest succeeded on 2026-05-03 with seven declared entries for the
 core and optional dependencies.
 ## Registered Extension Points
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,40 @@
 [build-system]
 requires = ["setuptools>=69"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "markitect-tool"
 version = "0.1.0"
 description = "Markdown-native toolkit and CLI for structured knowledge artifacts"
 readme = "README.md"
 requires-python = ">=3.12"
 license = { text = "MIT" }
 dependencies = [
    "click>=8.0",
    "markdown-it-py",
    "PyYAML",
 ]
 [project.optional-dependencies]
 dev = [
    "pytest>=8",
 ]
 query = [
    "jsonpath-ng>=1.5",
 ]
 tables = [
    "tabulate>=0.9",
 ]
 llm = [
    "llm-connect @ file:///home/worsch/llm-connect",
 ]
 [project.scripts]
 mkt = "markitect_tool.cli:main"
 [tool.setuptools.packages.find]
 where = ["src"]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
 pythonpath = ["src"]
--- a/sbom-tools.yaml
+++ b/sbom-tools.yaml
@@ -0,0 +1,29 @@
 tools:
  - name: click
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: markdown-it-py
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: PyYAML
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: pytest
    ecosystem: python
    is_direct: true
    is_dev: true
  - name: jsonpath-ng
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: tabulate
    ecosystem: python
    is_direct: true
    is_dev: false
  - name: llm-connect
    ecosystem: python
    is_direct: true
    is_dev: false
--- a/src/markitect_tool/init.py
+++ b/src/markitect_tool/init.py
@@ -0,0 +1,21 @@
 """Structured markdown primitives for markitect-tool."""
 from markitect_tool.core import (
    ContentBlock,
    Document,
    Heading,
    MarkdownParseError,
    Section,
    parse_markdown,
    parse_markdown_file,
 )
 __all__ = [
    "ContentBlock",
    "Document",
    "Heading",
    "MarkdownParseError",
    "Section",
    "parse_markdown",
    "parse_markdown_file",
 ]
--- a/src/markitect_tool/main.py
+++ b/src/markitect_tool/main.py
@@ -0,0 +1,6 @@
 """Run the `mkt` CLI with `python -m markitect_tool`."""
 from markitect_tool.cli import main
 main()
--- a/src/markitect_tool/cli/init.py
+++ b/src/markitect_tool/cli/init.py
@@ -0,0 +1,5 @@
 """Command-line interface for markitect-tool."""
 from markitect_tool.cli.main import main
 __all__ = ["main"]
--- a/src/markitect_tool/cli/main.py
+++ b/src/markitect_tool/cli/main.py
@@ -0,0 +1,44 @@
 """`mkt` command entry point."""
 from __future__ import annotations
 import json
 from pathlib import Path
 import click
 import yaml
 from markitect_tool.core import parse_markdown_file
@click.group()
@click.version_option()
 def main() -> None:
    """Markdown-native toolkit for structured knowledge artifacts."""
@main.command()
@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option(
    "--format",
    "output_format",
    type=click.Choice(["json", "yaml", "tree"], case_sensitive=False),
    default="json",
    show_default=True,
 )
 def parse(file: Path, output_format: str) -> None:
    """Parse a Markdown file into a structured representation."""
    document = parse_markdown_file(file)
    data = document.to_dict()
    if output_format == "yaml":
        click.echo(yaml.safe_dump(data, sort_keys=False))
    elif output_format == "tree":
        for heading in document.headings:
            click.echo(f"{'#' * heading.level} {heading.text}")
    else:
        click.echo(json.dumps(data, indent=2, ensure_ascii=False))
 if __name__ == "__main__":
    main()
--- a/src/markitect_tool/core/init.py
+++ b/src/markitect_tool/core/init.py
@@ -0,0 +1,14 @@
 """Core markdown parsing and document model."""
 from markitect_tool.core.document import ContentBlock, Document, Heading, Section
 from markitect_tool.core.parser import MarkdownParseError, parse_markdown, parse_markdown_file
 __all__ = [
    "ContentBlock",
    "Document",
    "Heading",
    "MarkdownParseError",
    "Section",
    "parse_markdown",
    "parse_markdown_file",
 ]
--- a/src/markitect_tool/core/document.py
+++ b/src/markitect_tool/core/document.py
@@ -0,0 +1,72 @@
 """Structured document model for parsed Markdown."""
 from __future__ import annotations
 from dataclasses import asdict, dataclass, field
 from typing import Any
@dataclass(frozen=True)
 class Heading:
    """A Markdown heading with source location."""
    level: int
    text: str
    line: int
    def to_dict(self) -> dict[str, Any]:
        return asdict(self)
@dataclass(frozen=True)
 class ContentBlock:
    """A top-level Markdown content block."""
    type: str
    text: str
    line_start: int | None = None
    line_end: int | None = None
    heading_level: int | None = None
    def to_dict(self) -> dict[str, Any]:
        data = asdict(self)
        return {key: value for key, value in data.items() if value is not None}
@dataclass(frozen=True)
 class Section:
    """A heading-led section."""
    heading: Heading
    blocks: list[ContentBlock] = field(default_factory=list)
    def to_dict(self) -> dict[str, Any]:
        return {
            "heading": self.heading.to_dict(),
            "blocks": [block.to_dict() for block in self.blocks],
        }
@dataclass(frozen=True)
 class Document:
    """Structured representation of a Markdown document."""
    source_path: str | None
    frontmatter: dict[str, Any]
    body: str
    blocks: list[ContentBlock]
    headings: list[Heading]
    sections: list[Section]
    tokens: list[dict[str, Any]]
    def to_dict(self) -> dict[str, Any]:
        data = {
            "source_path": self.source_path,
            "frontmatter": self.frontmatter,
            "body": self.body,
            "blocks": [block.to_dict() for block in self.blocks],
            "headings": [heading.to_dict() for heading in self.headings],
            "sections": [section.to_dict() for section in self.sections],
            "tokens": self.tokens,
        }
        return {key: value for key, value in data.items() if value is not None}
--- a/src/markitect_tool/core/parser.py
+++ b/src/markitect_tool/core/parser.py
@@ -0,0 +1,182 @@
 """Markdown parsing into a stable structured representation."""
 from __future__ import annotations
 from pathlib import Path
 from typing import Any
 import yaml
 from markdown_it import MarkdownIt
 from markdown_it.token import Token
 from markitect_tool.core.document import ContentBlock, Document, Heading, Section
 class MarkdownParseError(ValueError):
    """Raised when Markdown metadata cannot be parsed safely."""
 def parse_markdown_file(path: str | Path) -> Document:
    """Parse a Markdown file into a structured document."""
    file_path = Path(path)
    text = file_path.read_text(encoding="utf-8")
    return parse_markdown(text, source_path=str(file_path))
 def parse_markdown(markdown: str, source_path: str | None = None) -> Document:
    """Parse Markdown text into frontmatter, blocks, headings, sections, and tokens."""
    frontmatter, body, body_line_offset = _split_frontmatter(markdown)
    tokens = _parse_tokens(body)
    blocks, headings = _blocks_and_headings(tokens, body_line_offset)
    sections = _sections_from_blocks(blocks, headings)
    return Document(
        source_path=source_path,
        frontmatter=frontmatter,
        body=body,
        blocks=blocks,
        headings=headings,
        sections=sections,
        tokens=tokens,
    )
 def _split_frontmatter(markdown: str) -> tuple[dict[str, Any], str, int]:
    if not markdown.startswith("---\n"):
        return {}, markdown, 0
    end = markdown.find("\n---", 4)
    if end == -1:
        return {}, markdown, 0
    closing_end = markdown.find("\n", end + 4)
    if closing_end == -1:
        closing_end = len(markdown)
    else:
        closing_end += 1
    raw_frontmatter = markdown[4:end]
    body = markdown[closing_end:]
    try:
        data = yaml.safe_load(raw_frontmatter) if raw_frontmatter.strip() else {}
    except yaml.YAMLError as exc:
        raise MarkdownParseError(f"Invalid YAML frontmatter: {exc}") from exc
    if data is None:
        data = {}
    if not isinstance(data, dict):
        raise MarkdownParseError("Frontmatter must be a mapping")
    body_line_offset = markdown[:closing_end].count("\n")
    return data, body, body_line_offset
 def _parse_tokens(markdown: str) -> list[dict[str, Any]]:
    parser = MarkdownIt("commonmark", {"tables": True}).enable("table")
    return [_token_to_dict(token) for token in parser.parse(markdown)]
 def _token_to_dict(token: Token) -> dict[str, Any]:
    data = {
        "type": token.type,
        "tag": token.tag,
        "attrs": token.attrs,
        "map": token.map,
        "nesting": token.nesting,
        "level": token.level,
        "children": [_token_to_dict(child) for child in token.children]
        if token.children
        else None,
        "content": token.content,
        "markup": token.markup,
        "info": token.info,
        "meta": token.meta,
        "block": token.block,
        "hidden": token.hidden,
    }
    return {key: value for key, value in data.items() if value is not None}
 def _blocks_and_headings(
    tokens: list[dict[str, Any]], line_offset: int
 ) -> tuple[list[ContentBlock], list[Heading]]:
    blocks: list[ContentBlock] = []
    headings: list[Heading] = []
    for index, token in enumerate(tokens):
        token_type = token["type"]
        if token_type == "heading_open":
            inline = _next_inline(tokens, index)
            line_start, line_end = _line_range(token, line_offset)
            level = int(token.get("tag", "h1").lstrip("h") or "1")
            text = inline.get("content", "") if inline else ""
            heading = Heading(level=level, text=text, line=line_start or 1)
            headings.append(heading)
            blocks.append(
                ContentBlock(
                    type="heading",
                    text=text,
                    line_start=line_start,
                    line_end=line_end,
                    heading_level=level,
                )
            )
        elif token_type in {"paragraph_open", "bullet_list_open", "ordered_list_open", "blockquote_open", "fence", "code_block", "table_open"}:
            line_start, line_end = _line_range(token, line_offset)
            text = token.get("content", "")
            if not text and token_type.endswith("_open"):
                inline = _next_inline(tokens, index)
                text = inline.get("content", "") if inline else ""
            blocks.append(
                ContentBlock(
                    type=_block_type(token_type),
                    text=text,
                    line_start=line_start,
                    line_end=line_end,
                )
            )
    return blocks, headings
 def _next_inline(tokens: list[dict[str, Any]], index: int) -> dict[str, Any] | None:
    if index + 1 < len(tokens) and tokens[index + 1]["type"] == "inline":
        return tokens[index + 1]
    return None
 def _line_range(token: dict[str, Any], line_offset: int) -> tuple[int | None, int | None]:
    line_map = token.get("map")
    if not line_map:
        return None, None
    return line_map[0] + line_offset + 1, line_map[1] + line_offset
 def _block_type(token_type: str) -> str:
    return {
        "paragraph_open": "paragraph",
        "bullet_list_open": "bullet_list",
        "ordered_list_open": "ordered_list",
        "blockquote_open": "blockquote",
        "fence": "code",
        "code_block": "code",
        "table_open": "table",
    }.get(token_type, token_type)
 def _sections_from_blocks(
    blocks: list[ContentBlock], headings: list[Heading]
 ) -> list[Section]:
    sections: list[Section] = []
    current: Section | None = None
    heading_index = 0
    for block in blocks:
        if block.type == "heading":
            heading = headings[heading_index]
            heading_index += 1
            current = Section(heading=heading, blocks=[])
            sections.append(current)
        elif current is not None:
            current.blocks.append(block)
    return sections
--- a/tests/test_parse_contract.py
+++ b/tests/test_parse_contract.py
@@ -0,0 +1,89 @@
 from pathlib import Path
 import pytest
 from click.testing import CliRunner
 from markitect_tool import MarkdownParseError, parse_markdown, parse_markdown_file
 from markitect_tool.cli import main
 def test_parse_markdown_preserves_headings_and_paragraphs():
    document = parse_markdown("# Heading\n\nThis is a paragraph.")
    assert document.frontmatter == {}
    assert document.headings[0].level == 1
    assert document.headings[0].text == "Heading"
    assert [block.type for block in document.blocks] == ["heading", "paragraph"]
    assert document.sections[0].heading.text == "Heading"
    assert document.sections[0].blocks[0].text == "This is a paragraph."
    assert document.tokens[0]["type"] == "heading_open"
 def test_parse_markdown_extracts_yaml_frontmatter():
    markdown = """---
 title: YAML Frontmatter Test Document
 tags:
  - yaml
  - frontmatter
 published: true
 nested:
  priority: high
 ---
 # YAML Frontmatter Test Document
 Body text.
 """
    document = parse_markdown(markdown)
    assert document.frontmatter["title"] == "YAML Frontmatter Test Document"
    assert document.frontmatter["tags"] == ["yaml", "frontmatter"]
    assert document.frontmatter["published"] is True
    assert document.frontmatter["nested"]["priority"] == "high"
    assert document.headings[0].line == 11
    assert document.body.lstrip().startswith("# YAML Frontmatter Test Document")
 def test_parse_markdown_without_frontmatter_is_graceful():
    document = parse_markdown("# Document Without Frontmatter\n\nText.")
    assert document.frontmatter == {}
    assert document.headings[0].text == "Document Without Frontmatter"
 def test_parse_markdown_rejects_non_mapping_frontmatter():
    with pytest.raises(MarkdownParseError, match="Frontmatter must be a mapping"):
        parse_markdown("---\n- nope\n---\n\n# Bad")
 def test_parse_markdown_file_records_source_path(tmp_path: Path):
    source = tmp_path / "doc.md"
    source.write_text("# Test Document\n\nBody", encoding="utf-8")
    document = parse_markdown_file(source)
    assert document.source_path == str(source)
    assert document.headings[0].text == "Test Document"
 def test_mkt_parse_outputs_json(tmp_path: Path):
    source = tmp_path / "doc.md"
    source.write_text("# Test Document\n\nBody", encoding="utf-8")
    result = CliRunner().invoke(main, ["parse", str(source)])
    assert result.exit_code == 0
    assert '"headings"' in result.output
    assert "Test Document" in result.output
 def test_mkt_parse_outputs_tree(tmp_path: Path):
    source = tmp_path / "doc.md"
    source.write_text("# One\n\n## Two\n", encoding="utf-8")
    result = CliRunner().invoke(main, ["parse", str(source), "--format", "tree"])
    assert result.exit_code == 0
    assert "# One" in result.output
    assert "## Two" in result.output
--- a/workplans/MKTT-WP-0001-repo-foundation.md
+++ b/workplans/MKTT-WP-0001-repo-foundation.md
@@ -58,7 +58,7 @@ migration assessment, and implementation.
 ```task
 id: MKTT-WP-0001-T004
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "c15f8492-93d0-43aa-ba12-0d4aaff97c03"
 ```
@@ -67,11 +67,13 @@ Choose package/module names, Python version target, dependency manager, and
 test runner. Keep the decision lightweight and aligned with the future `mkt`
 CLI entry point.
 Output: `docs/packaging-decision.md`.
 ## P1.5 - Add SBOM source once manifests exist
 ```task
 id: MKTT-WP-0001-T005
-status: blocked
+status: done
 priority: medium
 state_hub_task_id: "e77a5e46-aaa2-4717-922f-a871fa2fd1cc"
 ```
@@ -79,4 +81,4 @@ state_hub_task_id: "e77a5e46-aaa2-4717-922f-a871fa2fd1cc"
 After packaging files are introduced, generate or identify the SBOM source and
 update State Hub registration metadata.
-Blocked because the repository has no implementation package manifest yet.
+Output: `sbom-tools.yaml`; initial State Hub ingest succeeded on 2026-05-03.
--- a/workplans/MKTT-WP-0003-core-toolkit-implementation.md
+++ b/workplans/MKTT-WP-0003-core-toolkit-implementation.md
@@ -22,7 +22,7 @@ contract and the `mkt` CLI.
 ```task
 id: MKTT-WP-0003-T001
-status: todo
+status: done
 priority: high
 state_hub_task_id: "9d9501fe-6809-4b4f-bda6-ec5e5952ddc7"
 ```
@@ -30,11 +30,13 @@ state_hub_task_id: "9d9501fe-6809-4b4f-bda6-ec5e5952ddc7"
 Create project metadata, package layout, test structure, and a minimal CLI
 entry point that can be installed or run locally.
 Output: `pyproject.toml`, `src/markitect_tool/`, `tests/`.
 ## P3.2 - Implement structured markdown parse contract
 ```task
 id: MKTT-WP-0003-T002
-status: todo
+status: done
 priority: high
 state_hub_task_id: "7dead033-e249-46b0-9eb3-908ae231a987"
 ```
@@ -43,6 +45,9 @@ Implement FR-001 and FR-002: parse markdown files, preserve headings,
 frontmatter, sections, and content blocks, and expose structured output via
 API and CLI.
 Initial implementation complete for Markdown files, YAML frontmatter, headings,
 sections, content blocks, parser tokens, API access, and `mkt parse`.
 ## P3.3 - Implement schema load and validation
 ```task
@@ -108,10 +113,12 @@ Implement FR-070 and FR-071 after the parse/schema contracts are stable.
 ```task
 id: MKTT-WP-0003-T008
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "94067c7e-e68b-45be-a1d6-90547eb15422"
 ```
 Resolve `TD-MKTT-001` by adding the implementation scaffold: package metadata,
 module layout, test runner, and `mkt` CLI entry point.
 Resolved by the initial package scaffold.