Workplan for dataflow pipeline workflows

This commit is contained in:
2026-05-04 01:22:45 +02:00
parent 1a1b5ab39c
commit 8260a66528
6 changed files with 337 additions and 1 deletions

View File

@@ -0,0 +1,145 @@
# Markdown Dataflow Workflow Assessment
Date: 2026-05-04
## Question
Can `markitect-tool` support workflows that grab data from one or more Markdown
documents, process it deterministically or with optional LLM assistance, and
inject the result into one or more Markdown outputs?
## Short Answer
Partially today, but not yet as a clean framework.
The current implementation provides the right primitives:
- parse Markdown and frontmatter
- query/extract structured content
- transform documents
- compose files
- include/transclude content
- render deterministic templates
- generate stubs from contracts
- run simple generation plans
- expose a provider-neutral assisted-generation hook
However, the user still has to orchestrate these steps manually through shell
commands or Python code. There is no declarative pipeline model that says:
```text
sources -> extracted data products -> deterministic processors -> assisted
processors -> templates/generation -> multiple outputs
```
That missing layer is where markitect can become much more practical.
## Comparison with markitect-main
`markitect-main` had more separate experiments:
- template parser and renderer
- data-driven draft generation
- prompt/LLM quality gates
- transclusion with variables and conditionals
- batch/document-processing commands
- infospace and spaces workflows
- cache/reference graph ideas
The new implementation is better in its core shape:
- smaller, provider-neutral modules
- deterministic behavior before optional LLM use
- CLI/API parity
- contracts as a stronger rule source than raw schemas
- structured query/extract feeding generation
- explicit safety boundaries for includes
- tests around each primitive
What we sacrificed:
- no first-class batch/pipeline runner yet
- no prompt/LLM workflow execution in core
- no variable/conditional transclusion yet
- no data-driven multi-record draft generator yet
- no workflow provenance graph tying inputs to outputs
- no multi-output orchestration
- no built-in object/data shaping between extraction and rendering
This is a good trade for the foundation, but the pipeline layer needs to exist.
## Desired Workflow Shape
A future pipeline plan should be Markdown-native and inspectable:
```markdown
# Release Note Pipeline
```yaml workflow
sources:
decisions:
glob: docs/adr/*.md
extract:
accepted:
selector: sections[heading=Decision]
status:
selector: frontmatter.status
steps:
summarize:
kind: deterministic.template
template: templates/release-summary.md
data:
decisions: ${sources.decisions.accepted}
assisted_review:
kind: assisted.generation
input: ${steps.summarize.markdown}
prompt: prompts/reviewer.md
optional: true
outputs:
release_notes:
template: templates/release-notes.md
data:
summary: ${steps.summarize.markdown}
review: ${steps.assisted_review.markdown}
output: out/release-notes.md
```
```
This should remain executable without LLM support. Assisted steps should be
optional, externally supplied, and policy-aware.
## Architecture Gap
The missing generalized layer needs:
- source collectors for Markdown files, globs, directories, and future indexes
- named extracted data products
- a small data expression model for referencing previous results
- deterministic step registry
- optional assisted step registry
- multi-output sinks
- provenance and diagnostics per step
- dry-run/plan/inspect modes
- caching and invalidation hooks
- policy hooks before assisted steps or sensitive output writes
## Relationship to Existing Workplans
- `MKTT-WP-0003` gives the primitive surface.
- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
- `MKTT-WP-0005` gives runtime context and form/assessment engines.
- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
wires these together.
## Recommendation
Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
and incremental processing for the current primitives.
Create a new workplan for declarative Markdown dataflow pipelines. It should be
P1/P2: important enough not to forget, but best implemented after the reference
and processor model has at least its first architecture pass.

View File

@@ -35,6 +35,7 @@ and descriptions mirror the operational view.
| `MKTT-WP-0010` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T006` | Trigger is satisfied; keep as the richer content-reference, processor, explode/implode, and weave/tangle track. |
| `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. |
| `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. |
| `MKTT-WP-0011` | P2 | todo | `MKTT-WP-0003`; task-level triggers: `MKTT-WP-0010-T001`, `MKTT-WP-0010-T005` | Declarative Markdown dataflow workflows: source extraction, deterministic/assisted processing, and multi-output generation. |
| `MKTT-WP-0009` | P2 | todo | `MKTT-WP-0006` | Establish access-control gateway before security-sensitive cache/context use. |
| `MKTT-WP-0008` | P3 | todo | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory cache after backend and policy floor are available. |
@@ -53,6 +54,12 @@ context-memory, and access-control architecture before those become rigid.
These are mixed task/workstream dependencies. State Hub does not currently model
them natively.
`MKTT-WP-0011` captures the practical workflow layer that wires existing
primitives together: Markdown sources, selectors, deterministic processors,
optional assisted generation hooks, and multiple Markdown outputs. It should not
block P3.7, but it should follow the first reference model and processor
registry decisions in `MKTT-WP-0010`.
## State Hub Mirror
Native State Hub dependency edges should mirror the whole-workstream
@@ -69,6 +76,7 @@ dependencies:
- `MKTT-WP-0007 -> MKTT-WP-0006`
- `MKTT-WP-0005 -> MKTT-WP-0003`
- `MKTT-WP-0005 -> MKTT-WP-0004`
- `MKTT-WP-0011 -> MKTT-WP-0003`
- `MKTT-WP-0009 -> MKTT-WP-0006`
- `MKTT-WP-0008 -> MKTT-WP-0006`
- `MKTT-WP-0008 -> MKTT-WP-0007`

View File

@@ -11,8 +11,10 @@ planning_order: 70
depends_on_workplans:
- MKTT-WP-0003
- MKTT-WP-0004
related_workplans:
- MKTT-WP-0011
created: "2026-05-03"
updated: "2026-05-03"
updated: "2026-05-04"
state_hub_workstream_id: "7918687e-2364-46b1-ab7e-65aa77cb8449"
---

View File

@@ -14,6 +14,7 @@ depends_on_tasks:
- MKTT-WP-0003-T005
related_workplans:
- MKTT-WP-0010
- MKTT-WP-0011
created: "2026-05-03"
updated: "2026-05-04"
state_hub_workstream_id: "0c585f8a-5c7e-4c89-b785-5b0089180256"

View File

@@ -17,6 +17,7 @@ informs_workplans:
- MKTT-WP-0007
- MKTT-WP-0008
- MKTT-WP-0009
- MKTT-WP-0011
created: "2026-05-04"
updated: "2026-05-04"
state_hub_workstream_id: "7863fd01-0be0-4dbc-9941-0151365bb9e1"

View File

@@ -0,0 +1,179 @@
---
id: MKTT-WP-0011
type: workplan
title: "Markdown Dataflow Pipeline Workflows"
domain: markitect
status: todo
owner: markitect-tool
topic_slug: markitect
planning_priority: P2
planning_order: 75
depends_on_workplans:
- MKTT-WP-0003
depends_on_tasks:
- MKTT-WP-0010-T001
- MKTT-WP-0010-T005
related_workplans:
- MKTT-WP-0005
- MKTT-WP-0006
- MKTT-WP-0008
- MKTT-WP-0009
created: "2026-05-04"
updated: "2026-05-04"
state_hub_workstream_id: "ed4c491d-4f81-4df0-af51-5f4bd4d1ad91"
---
# MKTT-WP-0011: Markdown Dataflow Pipeline Workflows
## Purpose
Create a declarative workflow layer for Markdown-to-Markdown dataflow:
collecting data from one or more Markdown sources, applying deterministic and
optional assisted processing, and injecting the results into one or more
Markdown outputs.
## Background
The current toolkit has strong primitives: parse, query, extract, transform,
compose, include, template render, contract stub generation, generation plans,
and a provider-neutral assisted-generation hook.
What is missing is orchestration. Users can script the pieces manually, but
there is not yet a first-class workflow model for:
```text
Markdown sources -> extracted data products -> processors -> generated outputs
```
See `docs/markdown-dataflow-workflow-assessment.md`.
## P11.1 - Define workflow plan model
```task
id: MKTT-WP-0011-T001
status: todo
priority: high
state_hub_task_id: "c335cbaa-dfb9-4df5-b1ae-87aaf6097bd8"
```
Define a Markdown/YAML workflow plan format with sources, named data products,
steps, outputs, variables, dry-run behavior, diagnostics, and provenance.
Output: workflow schema, examples, and validation diagnostics.
## P11.2 - Implement Markdown source collectors
```task
id: MKTT-WP-0011-T002
status: todo
priority: high
state_hub_task_id: "16a89801-d96d-437f-a883-81d09586f47a"
```
Collect source data from files, globs, directories, frontmatter paths,
selectors, sections, blocks, metrics, and future reference/index backends.
Output: source collector API, selector integration, and tests.
## P11.3 - Implement deterministic step registry
```task
id: MKTT-WP-0011-T003
status: todo
priority: high
state_hub_task_id: "808bed93-c7e2-4b34-90f4-f6f961fef503"
```
Create step types for query/extract, transform, compose, include, template
render, contract stub generation, contract checks, and data shaping.
Output: deterministic workflow runner with dependency ordering.
## P11.4 - Implement data expression and binding model
```task
id: MKTT-WP-0011-T004
status: todo
priority: high
state_hub_task_id: "ea1ad9d2-3668-4b65-afb4-f490e5bfd0c6"
```
Allow workflow steps and outputs to reference previous results by stable names,
for example `${sources.adrs.decisions}` or `${steps.summary.markdown}`.
Output: expression resolver, type checks, and missing-reference diagnostics.
## P11.5 - Add optional assisted processing step boundary
```task
id: MKTT-WP-0011-T005
status: todo
priority: medium
state_hub_task_id: "ed1adc60-fdd8-4d4c-b4d7-7ce906e641c6"
```
Add assisted step support through the provider-neutral generation hook protocol.
The workflow engine must not require provider dependencies and must support
dry-run, optional steps, and policy gates before sending data to a provider.
Output: hook adapter interface and tests with fake providers.
## P11.6 - Implement multi-output sinks
```task
id: MKTT-WP-0011-T006
status: todo
priority: high
state_hub_task_id: "902707d7-46fe-45d6-a9ec-b85763065ff9"
```
Support writing one or many Markdown outputs from templates, generated content,
or composed results. Outputs must be path-safe, reproducible, and traceable to
their source data.
Output: output sink API, path-safety checks, and provenance manifests.
## P11.7 - Add workflow CLI
```task
id: MKTT-WP-0011-T007
status: todo
priority: high
state_hub_task_id: "ccc26867-5724-4205-b3fe-a8b9d046775d"
```
Add:
```text
mkt workflow inspect <workflow.md>
mkt workflow plan <workflow.md>
mkt workflow run <workflow.md>
```
Include JSON/YAML outputs for agent use.
## P11.8 - Add representative end-to-end examples
```task
id: MKTT-WP-0011-T008
status: todo
priority: high
state_hub_task_id: "f8501ea6-1ead-477d-8f64-c196e7edfe68"
```
Create examples covering:
- multiple ADRs -> release notes
- contract data -> generated documents
- source snippets -> docs
- deterministic summary -> optional assisted review -> final Markdown
## Exit Criteria
- A non-programmer can write a Markdown/YAML workflow that extracts data from
Markdown documents and generates new Markdown outputs.
- The same workflow is repeatable for identical inputs.
- Assisted steps are optional and external.
- Diagnostics identify which source, step, or output failed.
- The implementation remains compatible with future references/processors,
cache/provenance, context engines, and access-control policy.