Files
markitect-tool/docs/markdown-dataflow-workflow-assessment.md

4.2 KiB

Markdown Dataflow Workflow Assessment

Date: 2026-05-04

Question

Can markitect-tool support workflows that grab data from one or more Markdown documents, process it deterministically or with optional LLM assistance, and inject the result into one or more Markdown outputs?

Short Answer

Partially today, but not yet as a clean framework.

The current implementation provides the right primitives:

  • parse Markdown and frontmatter
  • query/extract structured content
  • transform documents
  • compose files
  • include/transclude content
  • render deterministic templates
  • generate stubs from contracts
  • run simple generation plans
  • expose a provider-neutral assisted-generation hook

However, the user still has to orchestrate these steps manually through shell commands or Python code. There is no declarative pipeline model that says:

sources -> extracted data products -> deterministic processors -> assisted
processors -> templates/generation -> multiple outputs

That missing layer is where markitect can become much more practical.

Comparison with markitect-main

markitect-main had more separate experiments:

  • template parser and renderer
  • data-driven draft generation
  • prompt/LLM quality gates
  • transclusion with variables and conditionals
  • batch/document-processing commands
  • infospace and spaces workflows
  • cache/reference graph ideas

The new implementation is better in its core shape:

  • smaller, provider-neutral modules
  • deterministic behavior before optional LLM use
  • CLI/API parity
  • contracts as a stronger rule source than raw schemas
  • structured query/extract feeding generation
  • explicit safety boundaries for includes
  • tests around each primitive

What we sacrificed:

  • no first-class batch/pipeline runner yet
  • no prompt/LLM workflow execution in core
  • no variable/conditional transclusion yet
  • no data-driven multi-record draft generator yet
  • no workflow provenance graph tying inputs to outputs
  • no multi-output orchestration
  • no built-in object/data shaping between extraction and rendering

This is a good trade for the foundation, but the pipeline layer needs to exist.

Desired Workflow Shape

A future pipeline plan should be Markdown-native and inspectable:

# Release Note Pipeline

```yaml workflow
sources:
  decisions:
    glob: docs/adr/*.md
    extract:
      accepted:
        selector: sections[heading=Decision]
      status:
        selector: frontmatter.status

steps:
  summarize:
    kind: deterministic.template
    template: templates/release-summary.md
    data:
      decisions: ${sources.decisions.accepted}

  assisted_review:
    kind: assisted.generation
    input: ${steps.summarize.markdown}
    prompt: prompts/reviewer.md
    optional: true

outputs:
  release_notes:
    template: templates/release-notes.md
    data:
      summary: ${steps.summarize.markdown}
      review: ${steps.assisted_review.markdown}
    output: out/release-notes.md

This should remain executable without LLM support. Assisted steps should be
optional, externally supplied, and policy-aware.

## Architecture Gap

The missing generalized layer needs:

- source collectors for Markdown files, globs, directories, and future indexes
- named extracted data products
- a small data expression model for referencing previous results
- deterministic step registry
- optional assisted step registry
- multi-output sinks
- provenance and diagnostics per step
- dry-run/plan/inspect modes
- caching and invalidation hooks
- policy hooks before assisted steps or sensitive output writes

## Relationship to Existing Workplans

- `MKTT-WP-0003` gives the primitive surface.
- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
- `MKTT-WP-0005` gives runtime context and form/assessment engines.
- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
  wires these together.

## Recommendation

Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
and incremental processing for the current primitives.

Create a new workplan for declarative Markdown dataflow pipelines. It should be
P1/P2: important enough not to forget, but best implemented after the reference
and processor model has at least its first architecture pass.