# Building an Infospace with History — Tutorial This tutorial walks through how we built a structured **information space** (infospace) from Adam Smith's *The Wealth of Nations*, mapping classical economic concepts to Stafford Beer's **Viable System Model** (VSM), using MarkiTect's prompt dependency resolution and LLM integration. By the end you will understand how to: 1. Design schemas that scaffold structured LLM output 2. Write prompt templates with dependency injection (`@{macro}` syntax) 3. Populate source artifacts and reference material 4. Run an incremental, chapter-by-chapter pipeline 5. Track every change through git history 6. Measure completeness and consistency with metrics 7. Continue the work to process remaining chapters --- ## 1. The Idea We want to transform a large body of text — the full public-domain text of *The Wealth of Nations* (5 books, 35 chapters) — into a **curated collection of economic concepts and entities**, each mapped to the VSM. The challenge: this is too much for a single prompt. The text is hundreds of thousands of words. We need to work **incrementally**, one chapter at a time, building up the infospace and tracking progress. MarkiTect's prompt dependency resolution lets us define **templates** with `@{placeholder}` macros that are filled from an artifact repository at execution time. The pipeline compiles each template into a complete prompt, sends it to an LLM, and stores the output — all tracked by git. --- ## 2. Project Layout ``` examples/infospace-with-history/ │ ├── README.md # Project brief ├── TUTORIAL.md # This file ├── INFRA-TASKS.md # Infrastructure issues found during the experiment ├── process_chapters.py # Pipeline script │ ├── schemas/ # Output structure definitions │ ├── economic-entity-schema-v1.0.md │ ├── vsm-concept-schema-v1.0.md │ ├── vsm-mapping-schema-v1.0.md │ └── chapter-analysis-schema-v1.0.md │ ├── templates/ # Prompt templates (with @{macro} placeholders) │ ├── extract-entities.md │ ├── map-to-vsm.md │ ├── synthesize-analysis.md │ └── assess-metrics.md │ ├── artifacts/ # Input artifacts │ ├── sources/ # Chapter text (35 files) │ ├── guidelines/ # Extraction and mapping rules │ └── vsm-reference/ # VSM framework definition │ └── output/ # Generated artifacts (LLM outputs) ├── entities/ # Flat canonical entity set + chapter views │ ├── division-of-labour.md # Canonical entity file (PRIMARY) │ ├── exchange.md │ ├── commercial-society.md │ ├── ... │ ├── book-1-chapter-01-entities.md # Chapter view (transclusion) │ ├── book-1-chapter-01-prompt.md # Compiled prompt │ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md │ └── ... ├── mappings/ # Per-chapter VSM mappings ├── analyses/ # Per-chapter synthesised analyses └── metrics/ # Cross-chapter metrics reports ``` **Entity organisation**: The infospace maintains a **flat canonical set** of entities — one markdown file per entity, stored directly in `output/entities/`. When a chapter mentions an entity that already exists (detected by slug collision), the duplicate is skipped and the original definition is kept. This builds a **minimal necessary and sufficient set** of entities across the entire book. Per-chapter `*-entities.md` files are **secondary views** that use MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose entity content by reference. The same entity (e.g., `division-of-labour.md`) can appear in multiple chapter views. Editing a canonical entity file automatically updates every chapter view that references it. **Deduplication**: The pipeline tells the LLM which entities already exist (via the `@{existing_entities}` macro in the extraction template) so it focuses on genuinely new entities. At the file level, slug collisions are detected and skipped as a safety net. --- ## 3. Designing Schemas Before writing any prompts we defined **four schemas** that tell the LLM exactly what sections each output document must contain. This ensures every generated document is machine-parseable and comparable across chapters. ### Economic Entity Schema (`schemas/economic-entity-schema-v1.0.md`) Every extracted entity must have: - **H1 heading** with the entity name - **Definition** (20-150 words) - **Source Chapter** citing Book and Chapter - **Context** — where in Smith's argument the entity appears - **Economic Domain** (Production, Distribution, Exchange, etc.) Optional: Smith's Original Wording, Modern Interpretation. ### VSM Mapping Schema (`schemas/vsm-mapping-schema-v1.0.md`) Every entity-to-VSM mapping must have: - **H1 heading** in the format `Entity Name -> VSM Concept Name` - **Economic Entity Reference** and **VSM Concept Reference** - **Mapping Rationale** (minimum 30 words, grounded in Beer's definitions) - **Mapping Strength**: Strong, Moderate, or Weak ### Chapter Analysis Schema (`schemas/chapter-analysis-schema-v1.0.md`) The per-chapter synthesis includes: - **Chapter Summary** (50-300 words) - **Entities Extracted** — bulleted list - **VSM Mappings** — entity, concept, strength - **VSM Coverage** — explicit assessment of S1 through S5 and S3* - **Gaps & Observations** ### Metrics Schema (implicit in `assess-metrics` template) The metrics report computes: - VSM Concept Coverage (% of S1-S5, recursion, variety, etc.) - Chapter Coverage (% of 35 chapters processed) - Entity and Mapping counts - Terminology Consistency and Cross-reference Integrity scores **Key insight**: Schemas are not code — they are markdown documents that the LLM reads as instructions. This means you can iterate on them without changing any infrastructure. --- ## 4. Writing Prompt Templates Each template is a markdown file containing instructions for the LLM plus `@{macro_name}` placeholders that MarkiTect's resolver fills with artifact content at compile time. ### Template 1: Extract Entities (`templates/extract-entities.md`) ```markdown # Extract Economic Entities You are an analytical economist specialising in classical economic theory. Your task is to extract distinct economic entities from a chapter of Adam Smith's *The Wealth of Nations*. ## Source Chapter @{chapter_text} ## Extraction Guidelines @{extraction_rules} ## VSM Framework Context @{vsm_framework} ## Existing Entities @{existing_entities} ## Instructions [... detailed step-by-step instructions ...] ## Output Format Output each entity as a separate markdown document, delimited by `--- ENTITY: ---` markers. ``` The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`, `existing_entities`) are resolved by looking up artifacts by name in the relevant information spaces. The `existing_entities` list is dynamically generated at runtime from the canonical entity files already on disk, enabling incremental extraction without duplication. ### Template 2: Map to VSM (`templates/map-to-vsm.md`) Takes `@{entities}` (output from stage 1), `@{vsm_framework}`, and `@{mapping_rules}` as inputs. ### Template 3: Synthesise Analysis (`templates/synthesize-analysis.md`) Takes `@{chapter_text}`, `@{entities}` (stage 1 output), `@{mappings}` (stage 2 output), and `@{vsm_framework}`. ### Template 4: Assess Metrics (`templates/assess-metrics.md`) Takes `@{all_analyses}` (concatenation of all chapter analyses) and `@{vsm_framework}`. Runs across the entire infospace, not per-chapter. **Dependency chain per chapter:** ``` chapter_text ─────┐ extraction_rules ──┤ vsm_framework ────┤ ▼ extract-entities │ ▼ entities map-to-vsm │ ▼ mappings synthesize-analysis │ ▼ analysis ``` After all chapters are processed, `assess-metrics` evaluates the complete infospace. --- ## 5. Populating Artifacts ### Source chapters (`artifacts/sources/`) 35 markdown files containing the full public-domain text of each chapter. Named by convention: `book-1-chapter-01.md` through `book-5-chapter-03.md`, plus `introduction.md`. These are loaded into the `infospace-sources` information space. ### Guidelines (`artifacts/guidelines/`) Two hand-written reference documents: - **`extraction-rules.md`** — What constitutes an entity, granularity rules, naming conventions, quality checks. - **`mapping-rules.md`** — How to map entities to VSM systems, what constitutes strong/moderate/weak mapping strength. These are loaded into `infospace-guidelines`. ### VSM reference (`artifacts/vsm-reference/`) - **`vsm-framework.md`** — Complete description of Beer's VSM (S1-S5, S3*, recursion, variety, viability, attenuation/amplification, algedonic signals, autonomy). Includes economic interpretations for each system. Loaded into `infospace-vsm-reference`. --- ## 6. The Pipeline Script `process_chapters.py` orchestrates everything. It: 1. Initialises the artifact repository (SQLite) and information spaces 2. Loads all static artifacts (templates, guidelines, VSM reference) 3. For each chapter, runs the three-stage pipeline 4. Optionally calls an LLM to auto-generate outputs 5. Records dependency edges in the graph 6. Commits results to git ### Running a single chapter ```bash # Manual mode (writes prompts, waits for you to provide output files): python process_chapters.py --chapter book-1-chapter-05 --no-commit # Automatic mode via OpenRouter (recommended — fast, real token counts): python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit # Automatic mode via Claude Code CLI: python process_chapters.py --chapter book-1-chapter-05 --provider claude-code --no-commit # With a specific model: python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --no-commit ``` ### Running a whole book ```bash python process_chapters.py --book 1 --provider openrouter --no-commit ``` ### Running all chapters ```bash python process_chapters.py --all --provider openrouter --no-commit ``` ### Checking progress ```bash python process_chapters.py --list ``` Prints a table showing which chapters have completed each stage (entity counts reflect the chapter view's transclusion references, including shared entities from earlier chapters): ``` Available chapters (35): Chapter Entities Mappings Analysis ------------------------------ ------------ ------------ ------------ book-1-chapter-01 done (13) done done book-1-chapter-02 done (7) done done book-1-chapter-03 done (18) done done book-1-chapter-04 done (5) done done book-1-chapter-05 - - - ... Canonical entity set: 41 unique entities ``` ### Assessing metrics After processing a batch of chapters, run the metrics assessment: ```bash python process_chapters.py --metrics --provider openrouter --no-commit ``` This concatenates all completed analyses and asks the LLM to evaluate coverage, consistency, and completeness. ### Dependency statistics ```bash python process_chapters.py --stats ``` --- ## 7. How the LLM Integration Works The pipeline uses MarkiTect's `markitect.llm` module, which provides two adapter backends that implement the `LLMAdapter` interface: | Backend | How it works | Pros | Cons | |---------|-------------|------|------| | `openrouter` | HTTP POST to OpenRouter API | Fast, real token counts, any model | Needs API key | | `claude-code` | Shells out to `claude --print` | No API key needed if CLI installed | Slower, estimated token counts | ### API key setup (OpenRouter) Place your key in one of these locations (checked in order): 1. Pass `--api-key` on the command line (not yet implemented in the CLI) 2. Set `OPENROUTER_API_KEY` environment variable 3. Create `apikey-openrouter.txt` in the project root (git-ignored) ### What happens per stage 1. The pipeline **resolves** macro placeholders by looking up artifacts in the repository 2. It **compiles** the template into a complete prompt (macros replaced with real content) 3. It writes the compiled prompt to `output//-prompt.md` for inspection 4. If no output exists yet and an LLM adapter is configured, it **executes** the prompt 5. **For entity extraction (stage 1):** the pipeline first binds the list of already-existing entity slugs to `@{existing_entities}` so the LLM knows what to skip. The LLM returns combined content with `--- ENTITY: ---` delimiters. The pipeline splits this into the **flat canonical directory** (`output/entities/.md`), skipping any entity whose slug already exists. It then generates the chapter view file with transclusion directives. The combined content is never persisted as a single file — canonical entity files are the source of truth. 6. **For other stages:** the result is written directly to its output file 7. The output is **stored** as a generated artifact in the repository 8. Dependency edges are **recorded** in the graph --- ## 8. Tracking History with Git Every processed chapter produces a git commit containing: - Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent - Canonical entity files (`output/entities/.md`) — one file per entity, shared across chapters, first occurrence wins - Chapter entity views (`-entities.md`) — transclusion into the canonical entities relevant to each chapter - Generated outputs (`*-mappings.md`, `*-analysis.md`) This means: - `git log` shows the chronological order of processing - `git diff` between commits shows what each chapter contributed - You can `git bisect` to find where quality degraded - You can revert a chapter and re-process it with different settings To let the script auto-commit (default): ```bash python process_chapters.py --chapter book-1-chapter-05 --provider openrouter ``` To commit manually after reviewing: ```bash python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit # review new entity files in output/entities/ (look for recently modified .md files) # review chapter view in output/entities/book-1-chapter-05-entities.md git add examples/infospace-with-history/output/ git commit -m "infospace: process book-1-chapter-05" ``` --- ## 9. Cost and Performance From our measurements processing chapters 3 and 4: | | Claude Code CLI | OpenRouter | |---|---|---| | Time per chapter | ~5 minutes | ~2 minutes | | Token counts | Estimated (4 chars/tok) | Real (from API) | | Cost per chapter | ~$0.35 est. | ~$0.07 est. | **Projected cost for all 35 chapters via OpenRouter:** ~$2.50 (varies by chapter length; Book V chapters are longer). To reduce costs further, use a cheaper model: ```bash --provider openrouter --model anthropic/claude-haiku-4-5-20251001 ``` --- ## 10. Completing the Remaining Chapters As of now, 4 of 35 chapters are processed (Book I, Chapters 1-4). Here is how to complete the rest. ### Step-by-step **1. Process remaining Book I chapters (5-11):** ```bash python process_chapters.py --book 1 --provider openrouter --no-commit ``` Already-processed chapters are skipped (their chapter view files exist). Entities from earlier chapters are automatically shared — the LLM is told which entities already exist and avoids re-extracting them. **2. Process Books II-V:** ```bash python process_chapters.py --book 2 --provider openrouter --no-commit python process_chapters.py --book 3 --provider openrouter --no-commit python process_chapters.py --book 4 --provider openrouter --no-commit python process_chapters.py --book 5 --provider openrouter --no-commit ``` Or all at once: ```bash python process_chapters.py --all --provider openrouter --no-commit ``` **3. Run metrics after each book (or at the end):** ```bash python process_chapters.py --metrics --provider openrouter --no-commit ``` **4. Commit the results:** ```bash git add examples/infospace-with-history/output/ git commit -m "infospace: process all remaining chapters" ``` **5. Review the metrics report:** Open `output/metrics/metrics-report.md`. It will show: - Which VSM concepts (S1-S5, recursion, variety, etc.) now have mappings - Total entity and mapping counts - Consistency scores - Recommendations for gaps ### Expected progression | After | Chapters | Expected coverage | |-------|----------|-------------------| | Book I (11 ch.) | 11/35 | S1, S2, S4 strong; S3 emerging | | Books I-II (16 ch.) | 16/35 | S3 (capital control) covered | | Books I-III (20 ch.) | 20/35 | Historical patterns add depth | | Books I-IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging | | All (35 ch.) | 35/35 | Full coverage, S3* and algedonic signals likely from Book V | Book V (public revenue, taxation, sovereign duties) is expected to fill the remaining gaps in S3*, S5, and regulatory concepts. --- ## 11. Quality Improvement Loop The infospace is designed to be **iteratively refined**: 1. **Process chapters** — run the pipeline 2. **Assess metrics** — identify gaps in VSM coverage and consistency 3. **Refine guidelines** — update `extraction-rules.md` or `mapping-rules.md` to address identified weaknesses 4. **Re-process** — delete output files for specific chapters and re-run with improved guidelines 5. **Compare** — use git diff to see how the refined guidelines changed the output Example: if metrics show that S3* (Audit) is consistently missed, you could add a paragraph to `extraction-rules.md` explicitly asking the LLM to look for audit, inspection, and oversight mechanisms. To re-process a specific chapter, remove its chapter view and downstream outputs. Note: canonical entity files in `output/entities/` are shared across chapters — only delete individual entity files if you want them re-extracted from scratch. ```bash rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit ``` To also re-extract specific entities, delete their canonical files first: ```bash rm -f examples/infospace-with-history/output/entities/extent-of-the-market.md # then re-process the chapter as above ``` --- ## 12. Infrastructure Issues Found and Fixed During development we documented three issues with the MarkiTect infrastructure in `INFRA-TASKS.md`: 1. **Artifact repo doesn't store content** — the resolver returned placeholder text instead of actual artifact content. 2. **ContentMacro `raw_text` defaults to `""`** — caused silent data corruption when macros were constructed programmatically. 3. **No `@{target}` syntax in MacroParser** — macros had to be constructed manually rather than auto-detected from template text. All three have been fixed in the markitect infrastructure. The pipeline script (`process_chapters.py`) has been refactored to use the fixed infrastructure directly — the local content cache, manual macro construction, and manual substitution workarounds have been removed. See `INFRA-TASKS.md` for details on each fix. --- ## 13. Adapting This Pattern to Your Own Project To build your own infospace using this pattern: 1. **Choose your source corpus** — any collection of documents you want to transform into structured knowledge. 2. **Define your target ontology** — what concepts, relationships, or categories you want to extract (our VSM is just one example). 3. **Write schemas** — markdown documents defining the required sections and validation rules for each output type. 4. **Write extraction guidelines** — rules that tell the LLM what to look for and how to handle edge cases. 5. **Create prompt templates** — use `@{macro}` syntax to inject source text and guidelines at compile time. 6. **Build your pipeline** — follow `process_chapters.py` as a reference for loading artifacts, resolving templates, and calling the LLM. 7. **Process incrementally** — work through your corpus one document at a time, tracking everything in git. 8. **Measure and refine** — define metrics, assess them periodically, and update your guidelines when gaps appear. The key architectural insight is that **schemas and guidelines are artifacts** — they live in the same repository as your source text and can be versioned, diffed, and refined just like code.