Workplan for dataflow pipeline workflows

2026-05-04 01:22:45 +02:00
parent 1a1b5ab39c
commit 8260a66528
6 changed files with 337 additions and 1 deletions
--- a/docs/markdown-dataflow-workflow-assessment.md
+++ b/docs/markdown-dataflow-workflow-assessment.md
@@ -0,0 +1,145 @@
+# Markdown Dataflow Workflow Assessment
+
+Date: 2026-05-04
+
+## Question
+
+Can `markitect-tool` support workflows that grab data from one or more Markdown
+documents, process it deterministically or with optional LLM assistance, and
+inject the result into one or more Markdown outputs?
+
+## Short Answer
+
+Partially today, but not yet as a clean framework.
+
+The current implementation provides the right primitives:
+
+- parse Markdown and frontmatter
+- query/extract structured content
+- transform documents
+- compose files
+- include/transclude content
+- render deterministic templates
+- generate stubs from contracts
+- run simple generation plans
+- expose a provider-neutral assisted-generation hook
+
+However, the user still has to orchestrate these steps manually through shell
+commands or Python code. There is no declarative pipeline model that says:
+
+```text
+sources -> extracted data products -> deterministic processors -> assisted
+processors -> templates/generation -> multiple outputs
+```
+
+That missing layer is where markitect can become much more practical.
+
+## Comparison with markitect-main
+
+`markitect-main` had more separate experiments:
+
+- template parser and renderer
+- data-driven draft generation
+- prompt/LLM quality gates
+- transclusion with variables and conditionals
+- batch/document-processing commands
+- infospace and spaces workflows
+- cache/reference graph ideas
+
+The new implementation is better in its core shape:
+
+- smaller, provider-neutral modules
+- deterministic behavior before optional LLM use
+- CLI/API parity
+- contracts as a stronger rule source than raw schemas
+- structured query/extract feeding generation
+- explicit safety boundaries for includes
+- tests around each primitive
+
+What we sacrificed:
+
+- no first-class batch/pipeline runner yet
+- no prompt/LLM workflow execution in core
+- no variable/conditional transclusion yet
+- no data-driven multi-record draft generator yet
+- no workflow provenance graph tying inputs to outputs
+- no multi-output orchestration
+- no built-in object/data shaping between extraction and rendering
+
+This is a good trade for the foundation, but the pipeline layer needs to exist.
+
+## Desired Workflow Shape
+
+A future pipeline plan should be Markdown-native and inspectable:
+
+```markdown
+# Release Note Pipeline
+
+```yaml workflow
+sources:
+  decisions:
+    glob: docs/adr/*.md
+    extract:
+      accepted:
+        selector: sections[heading=Decision]
+      status:
+        selector: frontmatter.status
+
+steps:
+  summarize:
+    kind: deterministic.template
+    template: templates/release-summary.md
+    data:
+      decisions: ${sources.decisions.accepted}
+
+  assisted_review:
+    kind: assisted.generation
+    input: ${steps.summarize.markdown}
+    prompt: prompts/reviewer.md
+    optional: true
+
+outputs:
+  release_notes:
+    template: templates/release-notes.md
+    data:
+      summary: ${steps.summarize.markdown}
+      review: ${steps.assisted_review.markdown}
+    output: out/release-notes.md
+```
+```
+
+This should remain executable without LLM support. Assisted steps should be
+optional, externally supplied, and policy-aware.
+
+## Architecture Gap
+
+The missing generalized layer needs:
+
+- source collectors for Markdown files, globs, directories, and future indexes
+- named extracted data products
+- a small data expression model for referencing previous results
+- deterministic step registry
+- optional assisted step registry
+- multi-output sinks
+- provenance and diagnostics per step
+- dry-run/plan/inspect modes
+- caching and invalidation hooks
+- policy hooks before assisted steps or sensitive output writes
+
+## Relationship to Existing Workplans
+
+- `MKTT-WP-0003` gives the primitive surface.
+- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
+- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
+- `MKTT-WP-0005` gives runtime context and form/assessment engines.
+- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
+  wires these together.
+
+## Recommendation
+
+Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
+and incremental processing for the current primitives.
+
+Create a new workplan for declarative Markdown dataflow pipelines. It should be
+P1/P2: important enough not to forget, but best implemented after the reference
+and processor model has at least its first architecture pass.
--- a/docs/workplan-planning-map.md
+++ b/docs/workplan-planning-map.md
@@ -35,6 +35,7 @@ and descriptions mirror the operational view.
 | `MKTT-WP-0010` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T006` | Trigger is satisfied; keep as the richer content-reference, processor, explode/implode, and weave/tangle track. |
 | `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. |
 | `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. |
+| `MKTT-WP-0011` | P2 | todo | `MKTT-WP-0003`; task-level triggers: `MKTT-WP-0010-T001`, `MKTT-WP-0010-T005` | Declarative Markdown dataflow workflows: source extraction, deterministic/assisted processing, and multi-output generation. |
 | `MKTT-WP-0009` | P2 | todo | `MKTT-WP-0006` | Establish access-control gateway before security-sensitive cache/context use. |
 | `MKTT-WP-0008` | P3 | todo | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory cache after backend and policy floor are available. |

@@ -53,6 +54,12 @@ context-memory, and access-control architecture before those become rigid.
 These are mixed task/workstream dependencies. State Hub does not currently model
 them natively.

+`MKTT-WP-0011` captures the practical workflow layer that wires existing
+primitives together: Markdown sources, selectors, deterministic processors,
+optional assisted generation hooks, and multiple Markdown outputs. It should not
+block P3.7, but it should follow the first reference model and processor
+registry decisions in `MKTT-WP-0010`.
+
 ## State Hub Mirror

 Native State Hub dependency edges should mirror the whole-workstream
@@ -69,6 +76,7 @@ dependencies:
 - `MKTT-WP-0007 -> MKTT-WP-0006`
 - `MKTT-WP-0005 -> MKTT-WP-0003`
 - `MKTT-WP-0005 -> MKTT-WP-0004`
+- `MKTT-WP-0011 -> MKTT-WP-0003`
 - `MKTT-WP-0009 -> MKTT-WP-0006`
 - `MKTT-WP-0008 -> MKTT-WP-0006`
 - `MKTT-WP-0008 -> MKTT-WP-0007`
--- a/workplans/MKTT-WP-0005-runtime-context-and-assessment-engines.md
+++ b/workplans/MKTT-WP-0005-runtime-context-and-assessment-engines.md
@@ -11,8 +11,10 @@ planning_order: 70
 depends_on_workplans:
  - MKTT-WP-0003
  - MKTT-WP-0004
+related_workplans:
+  - MKTT-WP-0011
 created: "2026-05-03"
-updated: "2026-05-03"
+updated: "2026-05-04"
 state_hub_workstream_id: "7918687e-2364-46b1-ab7e-65aa77cb8449"
 ---

--- a/workplans/MKTT-WP-0006-cache-backend-architecture-core.md
+++ b/workplans/MKTT-WP-0006-cache-backend-architecture-core.md
@@ -14,6 +14,7 @@ depends_on_tasks:
  - MKTT-WP-0003-T005
 related_workplans:
  - MKTT-WP-0010
+  - MKTT-WP-0011
 created: "2026-05-03"
 updated: "2026-05-04"
 state_hub_workstream_id: "0c585f8a-5c7e-4c89-b785-5b0089180256"
--- a/workplans/MKTT-WP-0010-content-reference-processor-literate-workflows.md
+++ b/workplans/MKTT-WP-0010-content-reference-processor-literate-workflows.md
@@ -17,6 +17,7 @@ informs_workplans:
  - MKTT-WP-0007
  - MKTT-WP-0008
  - MKTT-WP-0009
+  - MKTT-WP-0011
 created: "2026-05-04"
 updated: "2026-05-04"
 state_hub_workstream_id: "7863fd01-0be0-4dbc-9941-0151365bb9e1"
--- a/workplans/MKTT-WP-0011-markdown-dataflow-pipeline-workflows.md
+++ b/workplans/MKTT-WP-0011-markdown-dataflow-pipeline-workflows.md
@@ -0,0 +1,179 @@
+---
+id: MKTT-WP-0011
+type: workplan
+title: "Markdown Dataflow Pipeline Workflows"
+domain: markitect
+status: todo
+owner: markitect-tool
+topic_slug: markitect
+planning_priority: P2
+planning_order: 75
+depends_on_workplans:
+  - MKTT-WP-0003
+depends_on_tasks:
+  - MKTT-WP-0010-T001
+  - MKTT-WP-0010-T005
+related_workplans:
+  - MKTT-WP-0005
+  - MKTT-WP-0006
+  - MKTT-WP-0008
+  - MKTT-WP-0009
+created: "2026-05-04"
+updated: "2026-05-04"
+state_hub_workstream_id: "ed4c491d-4f81-4df0-af51-5f4bd4d1ad91"
+---
+
+# MKTT-WP-0011: Markdown Dataflow Pipeline Workflows
+
+## Purpose
+
+Create a declarative workflow layer for Markdown-to-Markdown dataflow:
+collecting data from one or more Markdown sources, applying deterministic and
+optional assisted processing, and injecting the results into one or more
+Markdown outputs.
+
+## Background
+
+The current toolkit has strong primitives: parse, query, extract, transform,
+compose, include, template render, contract stub generation, generation plans,
+and a provider-neutral assisted-generation hook.
+
+What is missing is orchestration. Users can script the pieces manually, but
+there is not yet a first-class workflow model for:
+
+```text
+Markdown sources -> extracted data products -> processors -> generated outputs
+```
+
+See `docs/markdown-dataflow-workflow-assessment.md`.
+
+## P11.1 - Define workflow plan model
+
+```task
+id: MKTT-WP-0011-T001
+status: todo
+priority: high
+state_hub_task_id: "c335cbaa-dfb9-4df5-b1ae-87aaf6097bd8"
+```
+
+Define a Markdown/YAML workflow plan format with sources, named data products,
+steps, outputs, variables, dry-run behavior, diagnostics, and provenance.
+
+Output: workflow schema, examples, and validation diagnostics.
+
+## P11.2 - Implement Markdown source collectors
+
+```task
+id: MKTT-WP-0011-T002
+status: todo
+priority: high
+state_hub_task_id: "16a89801-d96d-437f-a883-81d09586f47a"
+```
+
+Collect source data from files, globs, directories, frontmatter paths,
+selectors, sections, blocks, metrics, and future reference/index backends.
+
+Output: source collector API, selector integration, and tests.
+
+## P11.3 - Implement deterministic step registry
+
+```task
+id: MKTT-WP-0011-T003
+status: todo
+priority: high
+state_hub_task_id: "808bed93-c7e2-4b34-90f4-f6f961fef503"
+```
+
+Create step types for query/extract, transform, compose, include, template
+render, contract stub generation, contract checks, and data shaping.
+
+Output: deterministic workflow runner with dependency ordering.
+
+## P11.4 - Implement data expression and binding model
+
+```task
+id: MKTT-WP-0011-T004
+status: todo
+priority: high
+state_hub_task_id: "ea1ad9d2-3668-4b65-afb4-f490e5bfd0c6"
+```
+
+Allow workflow steps and outputs to reference previous results by stable names,
+for example `${sources.adrs.decisions}` or `${steps.summary.markdown}`.
+
+Output: expression resolver, type checks, and missing-reference diagnostics.
+
+## P11.5 - Add optional assisted processing step boundary
+
+```task
+id: MKTT-WP-0011-T005
+status: todo
+priority: medium
+state_hub_task_id: "ed1adc60-fdd8-4d4c-b4d7-7ce906e641c6"
+```
+
+Add assisted step support through the provider-neutral generation hook protocol.
+The workflow engine must not require provider dependencies and must support
+dry-run, optional steps, and policy gates before sending data to a provider.
+
+Output: hook adapter interface and tests with fake providers.
+
+## P11.6 - Implement multi-output sinks
+
+```task
+id: MKTT-WP-0011-T006
+status: todo
+priority: high
+state_hub_task_id: "902707d7-46fe-45d6-a9ec-b85763065ff9"
+```
+
+Support writing one or many Markdown outputs from templates, generated content,
+or composed results. Outputs must be path-safe, reproducible, and traceable to
+their source data.
+
+Output: output sink API, path-safety checks, and provenance manifests.
+
+## P11.7 - Add workflow CLI
+
+```task
+id: MKTT-WP-0011-T007
+status: todo
+priority: high
+state_hub_task_id: "ccc26867-5724-4205-b3fe-a8b9d046775d"
+```
+
+Add:
+
+```text
+mkt workflow inspect <workflow.md>
+mkt workflow plan <workflow.md>
+mkt workflow run <workflow.md>
+```
+
+Include JSON/YAML outputs for agent use.
+
+## P11.8 - Add representative end-to-end examples
+
+```task
+id: MKTT-WP-0011-T008
+status: todo
+priority: high
+state_hub_task_id: "f8501ea6-1ead-477d-8f64-c196e7edfe68"
+```
+
+Create examples covering:
+
+- multiple ADRs -> release notes
+- contract data -> generated documents
+- source snippets -> docs
+- deterministic summary -> optional assisted review -> final Markdown
+
+## Exit Criteria
+
+- A non-programmer can write a Markdown/YAML workflow that extracts data from
+  Markdown documents and generates new Markdown outputs.
+- The same workflow is repeatable for identical inputs.
+- Assisted steps are optional and external.
+- Diagnostics identify which source, step, or output failed.
+- The implementation remains compatible with future references/processors,
+  cache/provenance, context engines, and access-control policy.