Files
infospace-bench/docs/wealth-vsm-generation-pipeline.md

3.1 KiB

Wealth VSM Generation Pipeline

Date: 2026-05-14

Purpose

This document defines how infospace-bench regenerates the Adam Smith Wealth of Nations / VSM infospace through explicit workflows.

The successor path is workflow-first. It does not reuse the legacy process_chapters.py entrypoint, hide provider calls in a broad command, or write generated files outside the artifact manifest.

Legacy pipeline decomposition

The old Wealth/VSM experiment in markitect-main processed source chapters through these conceptual stages:

Legacy stage Successor workflow shape Notes
extract-entities wealth-vsm-extract-entities assisted stage plus split_entities stage Assisted output is a chapter entity bundle; bench splits and registers stable entity artifacts.
map-to-vsm wealth-vsm-map-and-analyze assisted relation stage Relation artifacts use the successor relation parser and manifest IDs.
synthesize-analysis wealth-vsm-map-and-analyze assisted analysis stage Analysis remains a generated artifact with source provenance.
evaluate-entity wealth-vsm-evaluate-entities assisted stage Evaluation files use successor artifact_id frontmatter.
assess-metrics infospace-bench check Deterministic checks merge generated evaluations into metrics and history.

The first golden target is Book I Chapter III because it grounds the existing wealth-vsm-legacy-slice pilot and exercises the market-extent relation.

One-chapter pilot

infospaces/wealth-vsm-generation-pilot/ contains:

  • one source excerpt: book-1-chapter-03.md
  • explicit workflow declarations for extraction, VSM mapping/analysis, and entity evaluation
  • deterministic fixture responses for tests
  • markdown contracts for generated entity and relation artifacts
  • a pilot report comparing the successor workflow shape with the legacy process script

Default tests use fixture responses so they do not require network access, provider credentials, or live model output.

Live provider-backed generation

Any live provider-backed generation should use the same workflow declarations and the same assisted request records. Provider adapters must be selected explicitly by the caller and should record provider metadata in workflow run records and artifact provenance.

Live runs should document:

  • provider and model
  • prompt/template version
  • source corpus selection
  • retry and rate-limit settings
  • expected cost range
  • resume strategy
  • generated artifact review status

Full corpus scale-up

Scale-up should proceed only after the one-chapter pilot is green.

Recommended sequence:

  1. Run Book I Chapter III with fixture responses.
  2. Run Book I Chapter III with a live provider in a disposable copy.
  3. Review generated entities, relations, evaluations, and metrics.
  4. Add a small Book I batch with explicit cost and resume notes.
  5. Only then run the full corpus.

The full corpus should not be committed wholesale until it has a current scoped use, deterministic acceptance coverage, and a migration report explaining what was generated, reviewed, deferred, or retired.