Workplan for dataflow pipeline workflows

2026-05-04 01:22:45 +02:00
parent 1a1b5ab39c
commit 8260a66528
6 changed files with 337 additions and 1 deletions
--- a/docs/markdown-dataflow-workflow-assessment.md
+++ b/docs/markdown-dataflow-workflow-assessment.md
@@ -0,0 +1,145 @@
+# Markdown Dataflow Workflow Assessment
+
+Date: 2026-05-04
+
+## Question
+
+Can `markitect-tool` support workflows that grab data from one or more Markdown
+documents, process it deterministically or with optional LLM assistance, and
+inject the result into one or more Markdown outputs?
+
+## Short Answer
+
+Partially today, but not yet as a clean framework.
+
+The current implementation provides the right primitives:
+
+- parse Markdown and frontmatter
+- query/extract structured content
+- transform documents
+- compose files
+- include/transclude content
+- render deterministic templates
+- generate stubs from contracts
+- run simple generation plans
+- expose a provider-neutral assisted-generation hook
+
+However, the user still has to orchestrate these steps manually through shell
+commands or Python code. There is no declarative pipeline model that says:
+
+```text
+sources -> extracted data products -> deterministic processors -> assisted
+processors -> templates/generation -> multiple outputs
+```
+
+That missing layer is where markitect can become much more practical.
+
+## Comparison with markitect-main
+
+`markitect-main` had more separate experiments:
+
+- template parser and renderer
+- data-driven draft generation
+- prompt/LLM quality gates
+- transclusion with variables and conditionals
+- batch/document-processing commands
+- infospace and spaces workflows
+- cache/reference graph ideas
+
+The new implementation is better in its core shape:
+
+- smaller, provider-neutral modules
+- deterministic behavior before optional LLM use
+- CLI/API parity
+- contracts as a stronger rule source than raw schemas
+- structured query/extract feeding generation
+- explicit safety boundaries for includes
+- tests around each primitive
+
+What we sacrificed:
+
+- no first-class batch/pipeline runner yet
+- no prompt/LLM workflow execution in core
+- no variable/conditional transclusion yet
+- no data-driven multi-record draft generator yet
+- no workflow provenance graph tying inputs to outputs
+- no multi-output orchestration
+- no built-in object/data shaping between extraction and rendering
+
+This is a good trade for the foundation, but the pipeline layer needs to exist.
+
+## Desired Workflow Shape
+
+A future pipeline plan should be Markdown-native and inspectable:
+
+```markdown
+# Release Note Pipeline
+
+```yaml workflow
+sources:
+  decisions:
+    glob: docs/adr/*.md
+    extract:
+      accepted:
+        selector: sections[heading=Decision]
+      status:
+        selector: frontmatter.status
+
+steps:
+  summarize:
+    kind: deterministic.template
+    template: templates/release-summary.md
+    data:
+      decisions: ${sources.decisions.accepted}
+
+  assisted_review:
+    kind: assisted.generation
+    input: ${steps.summarize.markdown}
+    prompt: prompts/reviewer.md
+    optional: true
+
+outputs:
+  release_notes:
+    template: templates/release-notes.md
+    data:
+      summary: ${steps.summarize.markdown}
+      review: ${steps.assisted_review.markdown}
+    output: out/release-notes.md
+```
+```
+
+This should remain executable without LLM support. Assisted steps should be
+optional, externally supplied, and policy-aware.
+
+## Architecture Gap
+
+The missing generalized layer needs:
+
+- source collectors for Markdown files, globs, directories, and future indexes
+- named extracted data products
+- a small data expression model for referencing previous results
+- deterministic step registry
+- optional assisted step registry
+- multi-output sinks
+- provenance and diagnostics per step
+- dry-run/plan/inspect modes
+- caching and invalidation hooks
+- policy hooks before assisted steps or sensitive output writes
+
+## Relationship to Existing Workplans
+
+- `MKTT-WP-0003` gives the primitive surface.
+- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
+- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
+- `MKTT-WP-0005` gives runtime context and form/assessment engines.
+- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
+  wires these together.
+
+## Recommendation
+
+Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
+and incremental processing for the current primitives.
+
+Create a new workplan for declarative Markdown dataflow pipelines. It should be
+P1/P2: important enough not to forget, but best implemented after the reference
+and processor model has at least its first architecture pass.