Files
markitect-tool/docs/markdown-dataflow-workflow-assessment.md

146 lines
4.2 KiB
Markdown

# Markdown Dataflow Workflow Assessment
Date: 2026-05-04
## Question
Can `markitect-tool` support workflows that grab data from one or more Markdown
documents, process it deterministically or with optional LLM assistance, and
inject the result into one or more Markdown outputs?
## Short Answer
Partially today, but not yet as a clean framework.
The current implementation provides the right primitives:
- parse Markdown and frontmatter
- query/extract structured content
- transform documents
- compose files
- include/transclude content
- render deterministic templates
- generate stubs from contracts
- run simple generation plans
- expose a provider-neutral assisted-generation hook
However, the user still has to orchestrate these steps manually through shell
commands or Python code. There is no declarative pipeline model that says:
```text
sources -> extracted data products -> deterministic processors -> assisted
processors -> templates/generation -> multiple outputs
```
That missing layer is where markitect can become much more practical.
## Comparison with markitect-main
`markitect-main` had more separate experiments:
- template parser and renderer
- data-driven draft generation
- prompt/LLM quality gates
- transclusion with variables and conditionals
- batch/document-processing commands
- infospace and spaces workflows
- cache/reference graph ideas
The new implementation is better in its core shape:
- smaller, provider-neutral modules
- deterministic behavior before optional LLM use
- CLI/API parity
- contracts as a stronger rule source than raw schemas
- structured query/extract feeding generation
- explicit safety boundaries for includes
- tests around each primitive
What we sacrificed:
- no first-class batch/pipeline runner yet
- no prompt/LLM workflow execution in core
- no variable/conditional transclusion yet
- no data-driven multi-record draft generator yet
- no workflow provenance graph tying inputs to outputs
- no multi-output orchestration
- no built-in object/data shaping between extraction and rendering
This is a good trade for the foundation, but the pipeline layer needs to exist.
## Desired Workflow Shape
A future pipeline plan should be Markdown-native and inspectable:
```markdown
# Release Note Pipeline
```yaml workflow
sources:
decisions:
glob: docs/adr/*.md
extract:
accepted:
selector: sections[heading=Decision]
status:
selector: frontmatter.status
steps:
summarize:
kind: deterministic.template
template: templates/release-summary.md
data:
decisions: ${sources.decisions.accepted}
assisted_review:
kind: assisted.generation
input: ${steps.summarize.markdown}
prompt: prompts/reviewer.md
optional: true
outputs:
release_notes:
template: templates/release-notes.md
data:
summary: ${steps.summarize.markdown}
review: ${steps.assisted_review.markdown}
output: out/release-notes.md
```
```
This should remain executable without LLM support. Assisted steps should be
optional, externally supplied, and policy-aware.
## Architecture Gap
The missing generalized layer needs:
- source collectors for Markdown files, globs, directories, and future indexes
- named extracted data products
- a small data expression model for referencing previous results
- deterministic step registry
- optional assisted step registry
- multi-output sinks
- provenance and diagnostics per step
- dry-run/plan/inspect modes
- caching and invalidation hooks
- policy hooks before assisted steps or sensitive output writes
## Relationship to Existing Workplans
- `MKTT-WP-0003` gives the primitive surface.
- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
- `MKTT-WP-0005` gives runtime context and form/assessment engines.
- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
wires these together.
## Recommendation
Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
and incremental processing for the current primitives.
Create a new workplan for declarative Markdown dataflow pipelines. It should be
P1/P2: important enough not to forget, but best implemented after the reference
and processor model has at least its first architecture pass.