Files
infospace-bench/docs/wealth-vsm-generation-pipeline.md

77 lines
3.1 KiB
Markdown

# Wealth VSM Generation Pipeline
Date: 2026-05-14
## Purpose
This document defines how `infospace-bench` regenerates the Adam Smith
`Wealth of Nations` / VSM infospace through explicit workflows.
The successor path is workflow-first. It does not reuse the legacy
`process_chapters.py` entrypoint, hide provider calls in a broad command, or
write generated files outside the artifact manifest.
## Legacy pipeline decomposition
The old Wealth/VSM experiment in `markitect-main` processed source chapters
through these conceptual stages:
| Legacy stage | Successor workflow shape | Notes |
| --- | --- | --- |
| `extract-entities` | `wealth-vsm-extract-entities` assisted stage plus `split_entities` stage | Assisted output is a chapter entity bundle; bench splits and registers stable entity artifacts. |
| `map-to-vsm` | `wealth-vsm-map-and-analyze` assisted relation stage | Relation artifacts use the successor relation parser and manifest IDs. |
| `synthesize-analysis` | `wealth-vsm-map-and-analyze` assisted analysis stage | Analysis remains a generated artifact with source provenance. |
| `evaluate-entity` | `wealth-vsm-evaluate-entities` assisted stage | Evaluation files use successor `artifact_id` frontmatter. |
| `assess-metrics` | `infospace-bench check` | Deterministic checks merge generated evaluations into metrics and history. |
The first golden target is Book I Chapter III because it grounds the existing
`wealth-vsm-legacy-slice` pilot and exercises the market-extent relation.
## One-chapter pilot
`infospaces/wealth-vsm-generation-pilot/` contains:
- one source excerpt: `book-1-chapter-03.md`
- explicit workflow declarations for extraction, VSM mapping/analysis, and
entity evaluation
- deterministic fixture responses for tests
- markdown contracts for generated entity and relation artifacts
- a pilot report comparing the successor workflow shape with the legacy
process script
Default tests use fixture responses so they do not require network access,
provider credentials, or live model output.
## Live provider-backed generation
Any live provider-backed generation should use the same workflow declarations and
the same assisted request records. Provider adapters must be selected
explicitly by the caller and should record provider metadata in workflow run
records and artifact provenance.
Live runs should document:
- provider and model
- prompt/template version
- source corpus selection
- retry and rate-limit settings
- expected cost range
- resume strategy
- generated artifact review status
## Full corpus scale-up
Scale-up should proceed only after the one-chapter pilot is green.
Recommended sequence:
1. Run Book I Chapter III with fixture responses.
2. Run Book I Chapter III with a live provider in a disposable copy.
3. Review generated entities, relations, evaluations, and metrics.
4. Add a small Book I batch with explicit cost and resume notes.
5. Only then run the full corpus.
The full corpus should not be committed wholesale until it has a current scoped
use, deterministic acceptance coverage, and a migration report explaining what
was generated, reviewed, deferred, or retired.