markitect-tool/docs/markdown-dataflow-workflow-assessment.md

# Markdown Dataflow Workflow Assessment

Date: 2026-05-04

## Question

Can `markitect-tool` support workflows that grab data from one or more Markdown
documents, process it deterministically or with optional LLM assistance, and
inject the result into one or more Markdown outputs?

## Short Answer

Partially today, but not yet as a clean framework.

The current implementation provides the right primitives:

- parse Markdown and frontmatter
- query/extract structured content
- transform documents
- compose files
- include/transclude content
- render deterministic templates
- generate stubs from contracts
- run simple generation plans
- expose a provider-neutral assisted-generation hook

However, the user still has to orchestrate these steps manually through shell
commands or Python code. There is no declarative pipeline model that says:

```text
sources -> extracted data products -> deterministic processors -> assisted
processors -> templates/generation -> multiple outputs
```

That missing layer is where markitect can become much more practical.

## Comparison with markitect-main

`markitect-main` had more separate experiments:

- template parser and renderer
- data-driven draft generation
- prompt/LLM quality gates
- transclusion with variables and conditionals
- batch/document-processing commands
- infospace and spaces workflows
- cache/reference graph ideas

The new implementation is better in its core shape:

- smaller, provider-neutral modules
- deterministic behavior before optional LLM use
- CLI/API parity
- contracts as a stronger rule source than raw schemas
- structured query/extract feeding generation
- explicit safety boundaries for includes
- tests around each primitive

What we sacrificed:

- no first-class batch/pipeline runner yet
- no prompt/LLM workflow execution in core
- no variable/conditional transclusion yet
- no data-driven multi-record draft generator yet
- no workflow provenance graph tying inputs to outputs
- no multi-output orchestration
- no built-in object/data shaping between extraction and rendering

This is a good trade for the foundation, but the pipeline layer needs to exist.

## Desired Workflow Shape

A future pipeline plan should be Markdown-native and inspectable:

```markdown
# Release Note Pipeline

```yaml workflow
sources:
  decisions:
    glob: docs/adr/*.md
    extract:
      accepted:
        selector: sections[heading=Decision]
      status:
        selector: frontmatter.status

steps:
  summarize:
    kind: deterministic.template
    template: templates/release-summary.md
    data:
      decisions: ${sources.decisions.accepted}

  assisted_review:
    kind: assisted.generation
    input: ${steps.summarize.markdown}
    prompt: prompts/reviewer.md
    optional: true

outputs:
  release_notes:
    template: templates/release-notes.md
    data:
      summary: ${steps.summarize.markdown}
      review: ${steps.assisted_review.markdown}
    output: out/release-notes.md
```
```

This should remain executable without LLM support. Assisted steps should be
optional, externally supplied, and policy-aware.

## Architecture Gap

The missing generalized layer needs:

- source collectors for Markdown files, globs, directories, and future indexes
- named extracted data products
- a small data expression model for referencing previous results
- deterministic step registry
- optional assisted step registry
- multi-output sinks
- provenance and diagnostics per step
- dry-run/plan/inspect modes
- caching and invalidation hooks
- policy hooks before assisted steps or sensitive output writes

## Relationship to Existing Workplans

- `MKTT-WP-0003` gives the primitive surface.
- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
- `MKTT-WP-0005` gives runtime context and form/assessment engines.
- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
  wires these together.

## Recommendation

Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
and incremental processing for the current primitives.

Create a new workplan for declarative Markdown dataflow pipelines. It should be
P1/P2: important enough not to forget, but best implemented after the reference
and processor model has at least its first architecture pass.