generated from coulomb/repo-seed
77 lines
3.1 KiB
Markdown
77 lines
3.1 KiB
Markdown
# Wealth VSM Generation Pipeline
|
|
|
|
Date: 2026-05-14
|
|
|
|
## Purpose
|
|
|
|
This document defines how `infospace-bench` regenerates the Adam Smith
|
|
`Wealth of Nations` / VSM infospace through explicit workflows.
|
|
|
|
The successor path is workflow-first. It does not reuse the legacy
|
|
`process_chapters.py` entrypoint, hide provider calls in a broad command, or
|
|
write generated files outside the artifact manifest.
|
|
|
|
## Legacy pipeline decomposition
|
|
|
|
The old Wealth/VSM experiment in `markitect-main` processed source chapters
|
|
through these conceptual stages:
|
|
|
|
| Legacy stage | Successor workflow shape | Notes |
|
|
| --- | --- | --- |
|
|
| `extract-entities` | `wealth-vsm-extract-entities` assisted stage plus `split_entities` stage | Assisted output is a chapter entity bundle; bench splits and registers stable entity artifacts. |
|
|
| `map-to-vsm` | `wealth-vsm-map-and-analyze` assisted relation stage | Relation artifacts use the successor relation parser and manifest IDs. |
|
|
| `synthesize-analysis` | `wealth-vsm-map-and-analyze` assisted analysis stage | Analysis remains a generated artifact with source provenance. |
|
|
| `evaluate-entity` | `wealth-vsm-evaluate-entities` assisted stage | Evaluation files use successor `artifact_id` frontmatter. |
|
|
| `assess-metrics` | `infospace-bench check` | Deterministic checks merge generated evaluations into metrics and history. |
|
|
|
|
The first golden target is Book I Chapter III because it grounds the existing
|
|
`wealth-vsm-legacy-slice` pilot and exercises the market-extent relation.
|
|
|
|
## One-chapter pilot
|
|
|
|
`infospaces/wealth-vsm-generation-pilot/` contains:
|
|
|
|
- one source excerpt: `book-1-chapter-03.md`
|
|
- explicit workflow declarations for extraction, VSM mapping/analysis, and
|
|
entity evaluation
|
|
- deterministic fixture responses for tests
|
|
- markdown contracts for generated entity and relation artifacts
|
|
- a pilot report comparing the successor workflow shape with the legacy
|
|
process script
|
|
|
|
Default tests use fixture responses so they do not require network access,
|
|
provider credentials, or live model output.
|
|
|
|
## Live provider-backed generation
|
|
|
|
Any live provider-backed generation should use the same workflow declarations and
|
|
the same assisted request records. Provider adapters must be selected
|
|
explicitly by the caller and should record provider metadata in workflow run
|
|
records and artifact provenance.
|
|
|
|
Live runs should document:
|
|
|
|
- provider and model
|
|
- prompt/template version
|
|
- source corpus selection
|
|
- retry and rate-limit settings
|
|
- expected cost range
|
|
- resume strategy
|
|
- generated artifact review status
|
|
|
|
## Full corpus scale-up
|
|
|
|
Scale-up should proceed only after the one-chapter pilot is green.
|
|
|
|
Recommended sequence:
|
|
|
|
1. Run Book I Chapter III with fixture responses.
|
|
2. Run Book I Chapter III with a live provider in a disposable copy.
|
|
3. Review generated entities, relations, evaluations, and metrics.
|
|
4. Add a small Book I batch with explicit cost and resume notes.
|
|
5. Only then run the full corpus.
|
|
|
|
The full corpus should not be committed wholesale until it has a current scoped
|
|
use, deterministic acceptance coverage, and a migration report explaining what
|
|
was generated, reviewed, deferred, or retired.
|