# Building an Infospace with History — Tutorial This tutorial walks through how we built a structured **information space** (infospace) from Adam Smith's *The Wealth of Nations*, mapping classical economic concepts to Stafford Beer's **Viable System Model** (VSM), using MarkiTect's prompt dependency resolution and LLM integration. By the end you will understand how to: 1. Design schemas that scaffold structured LLM output 2. Write prompt templates with dependency injection (`@{macro}` syntax) 3. Populate source artifacts and reference material 4. Run an incremental, chapter-by-chapter pipeline 5. Track every change through git history 6. Measure completeness and consistency with metrics 7. Continue the work to process remaining chapters --- ## 1. The Idea We want to transform a large body of text — the full public-domain text of *The Wealth of Nations* (5 books, 35 chapters) — into a **curated collection of economic concepts and entities**, each mapped to the VSM. The challenge: this is too much for a single prompt. The text is hundreds of thousands of words. We need to work **incrementally**, one chapter at a time, building up the infospace and tracking progress. MarkiTect's prompt dependency resolution lets us define **templates** with `@{placeholder}` macros that are filled from an artifact repository at execution time. The pipeline compiles each template into a complete prompt, sends it to an LLM, and stores the output — all tracked by git. --- ## 2. Project Layout ``` examples/infospace-with-history/ │ ├── README.md # Project brief ├── TUTORIAL.md # This file ├── INFRA-TASKS.md # Infrastructure issues found during the experiment ├── process_chapters.py # Pipeline script ├── infospace.db # SQLite artifact database (generated, not in git) │ ├── schemas/ # Output structure definitions │ ├── economic-entity-schema-v1.0.md │ ├── vsm-concept-schema-v1.0.md │ ├── vsm-mapping-schema-v1.0.md │ └── chapter-analysis-schema-v1.0.md │ ├── templates/ # Prompt templates (with @{macro} placeholders) │ ├── extract-entities.md │ ├── map-to-vsm.md │ ├── synthesize-analysis.md │ └── assess-metrics.md │ ├── artifacts/ # Input artifacts │ ├── sources/ # Chapter text (35 files) │ ├── guidelines/ # Extraction and mapping rules │ └── vsm-reference/ # VSM framework definition │ └── output/ # Generated artifacts (LLM outputs) ├── entities/ # Flat canonical entity set + chapter views │ ├── division-of-labour.md # Canonical entity file (PRIMARY) │ ├── exchange.md │ ├── commercial-society.md │ ├── ... │ ├── book-1-chapter-01-entities.md # Chapter view (transclusion) │ ├── book-1-chapter-01-prompt.md # Compiled prompt │ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md │ └── ... ├── mappings/ # Per-chapter VSM mappings ├── analyses/ # Per-chapter synthesised analyses └── metrics/ # Cross-chapter metrics reports ``` **Entity organisation**: The infospace maintains a **flat canonical set** of entities — one markdown file per entity, stored directly in `output/entities/`. When a chapter mentions an entity that already exists (detected by slug collision), the duplicate is skipped and the original definition is kept. This builds a **minimal necessary and sufficient set** of entities across the entire book. Per-chapter `*-entities.md` files are **secondary views** that use MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose entity content by reference. The same entity (e.g., `division-of-labour.md`) can appear in multiple chapter views. Editing a canonical entity file automatically updates every chapter view that references it. **Deduplication**: The pipeline tells the LLM which entities already exist (via the `@{existing_entities}` macro in the extraction template) so it focuses on genuinely new entities. At the file level, slug collisions are detected and skipped as a safety net. **Entity lifecycle**: Once an entity enters the canonical set, it is **never silently deleted**. Entities may only be retired when they have been subsumed by another entity, found to partially map onto other entities, or otherwise determined to be redundant. Retired entities are **archived** — moved to `output/entities/archive/` with a dated header documenting the reason. This preserves the intellectual history of the infospace: every decision to drop an entity is a deliberate, documented learning. ```bash # Archive an entity that has been subsumed by another python process_chapters.py --archive-entity enlarged-monopoly \ --reason "Subsumed by monopoly-price — both describe the same market distortion" # The archived file retains its full content with an explanatory header cat output/entities/archive/enlarged-monopoly.md ``` The `--list` command shows both the active canonical set and the archive count. --- ## 3. Designing Schemas Before writing any prompts we defined **four schemas** that tell the LLM exactly what sections each output document must contain. This ensures every generated document is machine-parseable and comparable across chapters. ### Economic Entity Schema (`schemas/economic-entity-schema-v1.0.md`) Every extracted entity must have: - **H1 heading** with the entity name - **Definition** (20-150 words) - **Source Chapter** citing Book and Chapter - **Context** — where in Smith's argument the entity appears - **Economic Domain** (Production, Distribution, Exchange, etc.) Optional: Smith's Original Wording, Modern Interpretation. ### VSM Mapping Schema (`schemas/vsm-mapping-schema-v1.0.md`) Every entity-to-VSM mapping must have: - **H1 heading** in the format `Entity Name -> VSM Concept Name` - **Economic Entity Reference** and **VSM Concept Reference** - **Mapping Rationale** (minimum 30 words, grounded in Beer's definitions) - **Mapping Strength**: Strong, Moderate, or Weak ### Chapter Analysis Schema (`schemas/chapter-analysis-schema-v1.0.md`) The per-chapter synthesis includes: - **Chapter Summary** (50-300 words) - **Entities Extracted** — bulleted list - **VSM Mappings** — entity, concept, strength - **VSM Coverage** — explicit assessment of S1 through S5 and S3* - **Gaps & Observations** ### Metrics Schema (implicit in `assess-metrics` template) The metrics report computes: - VSM Concept Coverage (% of S1-S5, recursion, variety, etc.) - Chapter Coverage (% of 35 chapters processed) - Entity and Mapping counts - Terminology Consistency and Cross-reference Integrity scores **Key insight**: Schemas are not code — they are markdown documents that the LLM reads as instructions. This means you can iterate on them without changing any infrastructure. --- ## 4. Writing Prompt Templates Each template is a markdown file containing instructions for the LLM plus `@{macro_name}` placeholders that MarkiTect's resolver fills with artifact content at compile time. ### Template 1: Extract Entities (`templates/extract-entities.md`) ```markdown # Extract Economic Entities You are an analytical economist specialising in classical economic theory. Your task is to extract distinct economic entities from a chapter of Adam Smith's *The Wealth of Nations*. ## Source Chapter @{chapter_text} ## Extraction Guidelines @{extraction_rules} ## VSM Framework Context @{vsm_framework} ## Existing Entities @{existing_entities} ## Instructions [... detailed step-by-step instructions ...] ## Output Format Output each entity as a separate markdown document, delimited by `--- ENTITY: ---` markers. ``` The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`, `existing_entities`) are resolved by looking up artifacts by name in the relevant information spaces. The `existing_entities` list is dynamically generated at runtime from the canonical entity files already on disk, enabling incremental extraction without duplication. ### Template 2: Map to VSM (`templates/map-to-vsm.md`) Takes `@{entities}` (output from stage 1), `@{vsm_framework}`, and `@{mapping_rules}` as inputs. ### Template 3: Synthesise Analysis (`templates/synthesize-analysis.md`) Takes `@{chapter_text}`, `@{entities}` (stage 1 output), `@{mappings}` (stage 2 output), and `@{vsm_framework}`. ### Template 4: Assess Metrics (`templates/assess-metrics.md`) Takes `@{all_analyses}` (concatenation of all chapter analyses) and `@{vsm_framework}`. Runs across the entire infospace, not per-chapter. **Dependency chain per chapter:** ``` chapter_text ─────┐ extraction_rules ──┤ vsm_framework ────┤ ▼ extract-entities │ ▼ entities map-to-vsm │ ▼ mappings synthesize-analysis │ ▼ analysis ``` After all chapters are processed, `assess-metrics` evaluates the complete infospace. --- ## 5. Populating Artifacts ### Source chapters (`artifacts/sources/`) 35 markdown files containing the full public-domain text of each chapter. Named by convention: `book-1-chapter-01.md` through `book-5-chapter-03.md`, plus `introduction.md`. These are loaded into the `infospace-sources` information space. ### Guidelines (`artifacts/guidelines/`) Two hand-written reference documents: - **`extraction-rules.md`** — What constitutes an entity, granularity rules, naming conventions, quality checks. - **`mapping-rules.md`** — How to map entities to VSM systems, what constitutes strong/moderate/weak mapping strength. These are loaded into `infospace-guidelines`. ### VSM reference (`artifacts/vsm-reference/`) - **`vsm-framework.md`** — Complete description of Beer's VSM (S1-S5, S3*, recursion, variety, viability, attenuation/amplification, algedonic signals, autonomy). Includes economic interpretations for each system. Loaded into `infospace-vsm-reference`. --- ## 6. The Pipeline Script `process_chapters.py` orchestrates everything. It: 1. Initialises the artifact repository (SQLite) and information spaces 2. Loads all static artifacts (templates, guidelines, VSM reference) 3. For each chapter, runs the three-stage pipeline 4. Optionally calls an LLM to auto-generate outputs 5. Records dependency edges in the graph 6. Commits results to git ### Running a single chapter ```bash # Manual mode (writes prompts, waits for you to provide output files): python process_chapters.py --chapter book-1-chapter-05 --no-commit # Automatic mode via OpenRouter (recommended — fast, real token counts): python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit # Automatic mode via Gemini free tier: python process_chapters.py --chapter book-1-chapter-05 --provider gemini --no-commit # Automatic mode via Claude Code CLI: python process_chapters.py --chapter book-1-chapter-05 --provider claude-code --no-commit # With a specific model: python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --no-commit python process_chapters.py --chapter book-1-chapter-05 --provider gemini --model gemini-2.5-flash-lite --no-commit ``` ### Running a whole book ```bash python process_chapters.py --book 1 --provider openrouter --no-commit ``` ### Running all chapters ```bash python process_chapters.py --all --provider openrouter --no-commit ``` ### Checking progress ```bash python process_chapters.py --list ``` Prints a table showing which chapters have completed each stage (entity counts reflect the chapter view's transclusion references, including shared entities from earlier chapters): ``` Available chapters (35): Chapter Entities Mappings Analysis ------------------------------ ------------ ------------ ------------ book-1-chapter-01 done (13) done done book-1-chapter-02 done (7) done done book-1-chapter-03 done (18) done done book-1-chapter-04 done (5) done done book-1-chapter-05 done (1) done done ... Canonical entity set: 42 unique entities ``` ### Assessing metrics After processing a batch of chapters, run the metrics assessment: ```bash python process_chapters.py --metrics --provider openrouter --no-commit ``` This concatenates all completed analyses and asks the LLM to evaluate coverage, consistency, and completeness. ### Dependency statistics ```bash python process_chapters.py --stats ``` --- ## 7. The Artifact Database (`infospace.db`) The pipeline stores all artifacts (source text, templates, guidelines, generated outputs) and their dependency edges in a local SQLite database — `infospace.db`. This file is **not checked into git** because it is a derived cache that can be regenerated deterministically from the files already in the repository. ### Why it is excluded - **Binary format** — SQLite databases don't produce meaningful diffs and would bloat the git history with every pipeline run. - **Fully derived** — every piece of data in the database originates from markdown files that *are* tracked in git (sources, templates, schemas, guidelines, and generated output). - **Reproducible** — re-running the pipeline rebuilds the database from scratch without any LLM calls, because each stage checks for existing output files on disk before invoking the LLM. ### How to regenerate it If `infospace.db` is missing (e.g. after a fresh clone), rebuild it by re-running the pipeline over the chapters that already have output on disk: ```bash # Regenerate the database from existing output files (no LLM calls needed): python process_chapters.py --all --no-commit ``` This will: 1. Create a fresh `infospace.db` 2. Load all static artifacts (templates, guidelines, VSM reference) 3. For each chapter whose output files already exist, import them into the database and record dependency edges 4. Skip LLM calls entirely — existing files are detected and reused After regeneration, `--list` and `--stats` work as normal: ```bash python process_chapters.py --list python process_chapters.py --stats ``` --- ## 8. How the LLM Integration Works The pipeline uses MarkiTect's `markitect.llm` module, which provides three adapter backends that implement the `LLMAdapter` interface: | Backend | How it works | Pros | Cons | |---------|-------------|------|------| | `openrouter` | HTTP POST to OpenRouter API | Fast, real token counts, any model | Needs API key | | `claude-code` | Shells out to `claude --print` | No API key needed if CLI installed | Slower, estimated token counts | | `gemini` | HTTP POST to Google Generative Language API | Free tier available, real token counts | Rate limits on free tier | ### API key setup (OpenRouter) Place your key in one of these locations (checked in order): 1. Pass `--api-key` on the command line (not yet implemented in the CLI) 2. Set `OPENROUTER_API_KEY` environment variable 3. Create `apikey-openrouter.txt` in the project root (git-ignored) ### API key setup (Gemini) Place your Google AI Studio key in one of these locations (checked in order): 1. Set `GEMINI_API_KEY` environment variable 2. Create `apikey-geminifree.txt` in the project root (git-ignored) Get a free API key at https://aistudio.google.com/apikey. The free tier supports `gemini-2.5-flash` with generous rate limits. ### What happens per stage 1. The pipeline **resolves** macro placeholders by looking up artifacts in the repository 2. It **compiles** the template into a complete prompt (macros replaced with real content) 3. It writes the compiled prompt to `output//-prompt.md` for inspection 4. If no output exists yet and an LLM adapter is configured, it **executes** the prompt 5. **For entity extraction (stage 1):** the pipeline first binds the list of already-existing entity slugs to `@{existing_entities}` so the LLM knows what to skip. The LLM returns combined content with `--- ENTITY: ---` delimiters. The pipeline splits this into the **flat canonical directory** (`output/entities/.md`), skipping any entity whose slug already exists. It then generates the chapter view file with transclusion directives. The combined content is never persisted as a single file — canonical entity files are the source of truth. 6. **For other stages:** the result is written directly to its output file 7. The output is **stored** as a generated artifact in the repository 8. Dependency edges are **recorded** in the graph --- ## 9. Tracking History with Git Every processed chapter produces a git commit containing: - Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent - Canonical entity files (`output/entities/.md`) — one file per entity, shared across chapters, first occurrence wins - Chapter entity views (`-entities.md`) — transclusion into the canonical entities relevant to each chapter - Generated outputs (`*-mappings.md`, `*-analysis.md`) This means: - `git log` shows the chronological order of processing - `git diff` between commits shows what each chapter contributed - You can `git bisect` to find where quality degraded - You can revert a chapter and re-process it with different settings To let the script auto-commit (default): ```bash python process_chapters.py --chapter book-1-chapter-05 --provider openrouter ``` To commit manually after reviewing: ```bash python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit # review new entity files in output/entities/ (look for recently modified .md files) # review chapter view in output/entities/book-1-chapter-05-entities.md git add examples/infospace-with-history/output/ git commit -m "infospace: process book-1-chapter-05" ``` --- ## 10. Cost and Performance From our measurements processing chapters 3-5: | | Claude Code CLI | OpenRouter | Gemini (free) | |---|---|---|---| | Time per chapter | ~5 minutes | ~2 minutes | ~45 seconds | | Token counts | Estimated (4 chars/tok) | Real (from API) | Real (from API) | | Cost per chapter | ~$0.35 est. | ~$0.07 est. | Free | | Default model | claude-sonnet-4 | anthropic/claude-sonnet-4 | gemini-2.5-flash | **Projected cost for all 35 chapters via OpenRouter:** ~$2.50 (varies by chapter length; Book V chapters are longer). **Gemini free tier** is a zero-cost option for experimentation and development. The `gemini-2.5-flash` model produces good results. Rate limits apply — the pipeline retries automatically on 429 responses. To reduce costs further, use a cheaper model: ```bash --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --provider gemini --model gemini-2.5-flash-lite ``` --- ## 11. Completing the Remaining Chapters As of now, 5 of 35 chapters are processed (Book I, Chapters 1-5). Here is how to complete the rest. ### Step-by-step **1. Process remaining Book I chapters (6-11):** ```bash python process_chapters.py --book 1 --provider openrouter --no-commit ``` Already-processed chapters are skipped (their chapter view files exist). Entities from earlier chapters are automatically shared — the LLM is told which entities already exist and avoids re-extracting them. **2. Process Books II-V:** ```bash python process_chapters.py --book 2 --provider openrouter --no-commit python process_chapters.py --book 3 --provider openrouter --no-commit python process_chapters.py --book 4 --provider openrouter --no-commit python process_chapters.py --book 5 --provider openrouter --no-commit ``` Or all at once: ```bash python process_chapters.py --all --provider openrouter --no-commit ``` **3. Run metrics after each book (or at the end):** ```bash python process_chapters.py --metrics --provider openrouter --no-commit ``` **4. Commit the results:** ```bash git add examples/infospace-with-history/output/ git commit -m "infospace: process all remaining chapters" ``` **5. Review the metrics report:** Open `output/metrics/metrics-report.md`. It will show: - Which VSM concepts (S1-S5, recursion, variety, etc.) now have mappings - Total entity and mapping counts - Consistency scores - Recommendations for gaps ### Expected progression | After | Chapters | Expected coverage | |-------|----------|-------------------| | Book I (11 ch.) | 11/35 | S1, S2, S4 strong; S3 emerging | | Books I-II (16 ch.) | 16/35 | S3 (capital control) covered | | Books I-III (20 ch.) | 20/35 | Historical patterns add depth | | Books I-IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging | | All (35 ch.) | 35/35 | Full coverage, S3* and algedonic signals likely from Book V | Book V (public revenue, taxation, sovereign duties) is expected to fill the remaining gaps in S3*, S5, and regulatory concepts. --- ## 12. Quality Improvement Loop The infospace is designed to be **iteratively refined**: 1. **Process chapters** — run the pipeline 2. **Assess metrics** — identify gaps in VSM coverage and consistency 3. **Refine guidelines** — update `extraction-rules.md` or `mapping-rules.md` to address identified weaknesses 4. **Re-process** — delete output files for specific chapters and re-run with improved guidelines 5. **Compare** — use git diff to see how the refined guidelines changed the output Example: if metrics show that S3* (Audit) is consistently missed, you could add a paragraph to `extraction-rules.md` explicitly asking the LLM to look for audit, inspection, and oversight mechanisms. To re-process a specific chapter, remove its chapter view and downstream outputs. Note: canonical entity files in `output/entities/` are shared across chapters — only delete individual entity files if you want them re-extracted from scratch. ```bash rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit ``` **Important**: never silently delete canonical entity files. If an entity is no longer needed, **archive** it with a documented reason: ```bash # Entity found to be redundant — archive it python process_chapters.py --archive-entity extent-of-the-market \ --reason "Subsumed by market-price and effectual-demand — the concept is fully covered by these two entities" # Then re-process the chapter python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit ``` If you genuinely need to re-extract an entity with different content (e.g., improving its definition), archive the old version first, then delete the archive copy only after confirming the new version is better. The archive in `output/entities/archive/` preserves the full intellectual history of the infospace — every refinement decision is traceable. --- ## 13. Infrastructure Issues Found and Fixed During development we documented three issues with the MarkiTect infrastructure in `INFRA-TASKS.md`: 1. **Artifact repo doesn't store content** — the resolver returned placeholder text instead of actual artifact content. 2. **ContentMacro `raw_text` defaults to `""`** — caused silent data corruption when macros were constructed programmatically. 3. **No `@{target}` syntax in MacroParser** — macros had to be constructed manually rather than auto-detected from template text. All three have been fixed in the markitect infrastructure. The pipeline script (`process_chapters.py`) has been refactored to use the fixed infrastructure directly — the local content cache, manual macro construction, and manual substitution workarounds have been removed. See `INFRA-TASKS.md` for details on each fix. --- ## 14. Adapting This Pattern to Your Own Project To build your own infospace using this pattern: 1. **Choose your source corpus** — any collection of documents you want to transform into structured knowledge. 2. **Define your target ontology** — what concepts, relationships, or categories you want to extract (our VSM is just one example). 3. **Write schemas** — markdown documents defining the required sections and validation rules for each output type. 4. **Write extraction guidelines** — rules that tell the LLM what to look for and how to handle edge cases. 5. **Create prompt templates** — use `@{macro}` syntax to inject source text and guidelines at compile time. 6. **Build your pipeline** — follow `process_chapters.py` as a reference for loading artifacts, resolving templates, and calling the LLM. 7. **Process incrementally** — work through your corpus one document at a time, tracking everything in git. 8. **Measure and refine** — define metrics, assess them periodically, and update your guidelines when gaps appear. The key architectural insight is that **schemas and guidelines are artifacts** — they live in the same repository as your source text and can be versioned, diffed, and refined just like code.