generated from coulomb/repo-seed
Workplan for dataflow pipeline workflows
This commit is contained in:
145
docs/markdown-dataflow-workflow-assessment.md
Normal file
145
docs/markdown-dataflow-workflow-assessment.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Markdown Dataflow Workflow Assessment
|
||||
|
||||
Date: 2026-05-04
|
||||
|
||||
## Question
|
||||
|
||||
Can `markitect-tool` support workflows that grab data from one or more Markdown
|
||||
documents, process it deterministically or with optional LLM assistance, and
|
||||
inject the result into one or more Markdown outputs?
|
||||
|
||||
## Short Answer
|
||||
|
||||
Partially today, but not yet as a clean framework.
|
||||
|
||||
The current implementation provides the right primitives:
|
||||
|
||||
- parse Markdown and frontmatter
|
||||
- query/extract structured content
|
||||
- transform documents
|
||||
- compose files
|
||||
- include/transclude content
|
||||
- render deterministic templates
|
||||
- generate stubs from contracts
|
||||
- run simple generation plans
|
||||
- expose a provider-neutral assisted-generation hook
|
||||
|
||||
However, the user still has to orchestrate these steps manually through shell
|
||||
commands or Python code. There is no declarative pipeline model that says:
|
||||
|
||||
```text
|
||||
sources -> extracted data products -> deterministic processors -> assisted
|
||||
processors -> templates/generation -> multiple outputs
|
||||
```
|
||||
|
||||
That missing layer is where markitect can become much more practical.
|
||||
|
||||
## Comparison with markitect-main
|
||||
|
||||
`markitect-main` had more separate experiments:
|
||||
|
||||
- template parser and renderer
|
||||
- data-driven draft generation
|
||||
- prompt/LLM quality gates
|
||||
- transclusion with variables and conditionals
|
||||
- batch/document-processing commands
|
||||
- infospace and spaces workflows
|
||||
- cache/reference graph ideas
|
||||
|
||||
The new implementation is better in its core shape:
|
||||
|
||||
- smaller, provider-neutral modules
|
||||
- deterministic behavior before optional LLM use
|
||||
- CLI/API parity
|
||||
- contracts as a stronger rule source than raw schemas
|
||||
- structured query/extract feeding generation
|
||||
- explicit safety boundaries for includes
|
||||
- tests around each primitive
|
||||
|
||||
What we sacrificed:
|
||||
|
||||
- no first-class batch/pipeline runner yet
|
||||
- no prompt/LLM workflow execution in core
|
||||
- no variable/conditional transclusion yet
|
||||
- no data-driven multi-record draft generator yet
|
||||
- no workflow provenance graph tying inputs to outputs
|
||||
- no multi-output orchestration
|
||||
- no built-in object/data shaping between extraction and rendering
|
||||
|
||||
This is a good trade for the foundation, but the pipeline layer needs to exist.
|
||||
|
||||
## Desired Workflow Shape
|
||||
|
||||
A future pipeline plan should be Markdown-native and inspectable:
|
||||
|
||||
```markdown
|
||||
# Release Note Pipeline
|
||||
|
||||
```yaml workflow
|
||||
sources:
|
||||
decisions:
|
||||
glob: docs/adr/*.md
|
||||
extract:
|
||||
accepted:
|
||||
selector: sections[heading=Decision]
|
||||
status:
|
||||
selector: frontmatter.status
|
||||
|
||||
steps:
|
||||
summarize:
|
||||
kind: deterministic.template
|
||||
template: templates/release-summary.md
|
||||
data:
|
||||
decisions: ${sources.decisions.accepted}
|
||||
|
||||
assisted_review:
|
||||
kind: assisted.generation
|
||||
input: ${steps.summarize.markdown}
|
||||
prompt: prompts/reviewer.md
|
||||
optional: true
|
||||
|
||||
outputs:
|
||||
release_notes:
|
||||
template: templates/release-notes.md
|
||||
data:
|
||||
summary: ${steps.summarize.markdown}
|
||||
review: ${steps.assisted_review.markdown}
|
||||
output: out/release-notes.md
|
||||
```
|
||||
```
|
||||
|
||||
This should remain executable without LLM support. Assisted steps should be
|
||||
optional, externally supplied, and policy-aware.
|
||||
|
||||
## Architecture Gap
|
||||
|
||||
The missing generalized layer needs:
|
||||
|
||||
- source collectors for Markdown files, globs, directories, and future indexes
|
||||
- named extracted data products
|
||||
- a small data expression model for referencing previous results
|
||||
- deterministic step registry
|
||||
- optional assisted step registry
|
||||
- multi-output sinks
|
||||
- provenance and diagnostics per step
|
||||
- dry-run/plan/inspect modes
|
||||
- caching and invalidation hooks
|
||||
- policy hooks before assisted steps or sensitive output writes
|
||||
|
||||
## Relationship to Existing Workplans
|
||||
|
||||
- `MKTT-WP-0003` gives the primitive surface.
|
||||
- `MKTT-WP-0010` gives richer references, processors, regions, and chunks.
|
||||
- `MKTT-WP-0006` gives backend/provenance/cache interfaces.
|
||||
- `MKTT-WP-0005` gives runtime context and form/assessment engines.
|
||||
- `MKTT-WP-0011` should become the declarative pipeline/workflow layer that
|
||||
wires these together.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Do not squeeze this into P3.7. P3.7 should stay focused on lightweight caching
|
||||
and incremental processing for the current primitives.
|
||||
|
||||
Create a new workplan for declarative Markdown dataflow pipelines. It should be
|
||||
P1/P2: important enough not to forget, but best implemented after the reference
|
||||
and processor model has at least its first architecture pass.
|
||||
Reference in New Issue
Block a user