Files
markitect-main/examples/infospace-with-history/TUTORIAL.md
tegwick 2d1282a61e feat(infospace): flat canonical entity set with cross-chapter deduplication
Restructure entity storage from per-chapter subdirectories to a flat
canonical set in output/entities/. Each entity exists as a single file;
duplicates across chapters are detected by slug collision and skipped
(first occurrence wins). Chapter views use {{ include }} transclusion
to reference shared entity files.

Add @{existing_entities} macro to extract-entities template so the LLM
knows which entities already exist and focuses on genuinely new ones.
Refactor _call_llm() from _execute_llm() for callers that handle their
own file I/O. 41 unique entities from 4 chapters (2 duplicates removed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 22:24:20 +01:00

600 lines
21 KiB
Markdown

# Building an Infospace with History — Tutorial
This tutorial walks through how we built a structured **information space**
(infospace) from Adam Smith's *The Wealth of Nations*, mapping classical
economic concepts to Stafford Beer's **Viable System Model** (VSM), using
MarkiTect's prompt dependency resolution and LLM integration.
By the end you will understand how to:
1. Design schemas that scaffold structured LLM output
2. Write prompt templates with dependency injection (`@{macro}` syntax)
3. Populate source artifacts and reference material
4. Run an incremental, chapter-by-chapter pipeline
5. Track every change through git history
6. Measure completeness and consistency with metrics
7. Continue the work to process remaining chapters
---
## 1. The Idea
We want to transform a large body of text — the full public-domain text of
*The Wealth of Nations* (5 books, 35 chapters) — into a **curated
collection of economic concepts and entities**, each mapped to the VSM.
The challenge: this is too much for a single prompt. The text is hundreds of
thousands of words. We need to work **incrementally**, one chapter at a time,
building up the infospace and tracking progress.
MarkiTect's prompt dependency resolution lets us define **templates** with
`@{placeholder}` macros that are filled from an artifact repository at
execution time. The pipeline compiles each template into a complete prompt,
sends it to an LLM, and stores the output — all tracked by git.
---
## 2. Project Layout
```
examples/infospace-with-history/
├── README.md # Project brief
├── TUTORIAL.md # This file
├── INFRA-TASKS.md # Infrastructure issues found during the experiment
├── process_chapters.py # Pipeline script
├── schemas/ # Output structure definitions
│ ├── economic-entity-schema-v1.0.md
│ ├── vsm-concept-schema-v1.0.md
│ ├── vsm-mapping-schema-v1.0.md
│ └── chapter-analysis-schema-v1.0.md
├── templates/ # Prompt templates (with @{macro} placeholders)
│ ├── extract-entities.md
│ ├── map-to-vsm.md
│ ├── synthesize-analysis.md
│ └── assess-metrics.md
├── artifacts/ # Input artifacts
│ ├── sources/ # Chapter text (35 files)
│ ├── guidelines/ # Extraction and mapping rules
│ └── vsm-reference/ # VSM framework definition
└── output/ # Generated artifacts (LLM outputs)
├── entities/ # Flat canonical entity set + chapter views
│ ├── division-of-labour.md # Canonical entity file (PRIMARY)
│ ├── exchange.md
│ ├── commercial-society.md
│ ├── ...
│ ├── book-1-chapter-01-entities.md # Chapter view (transclusion)
│ ├── book-1-chapter-01-prompt.md # Compiled prompt
│ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md
│ └── ...
├── mappings/ # Per-chapter VSM mappings
├── analyses/ # Per-chapter synthesised analyses
└── metrics/ # Cross-chapter metrics reports
```
**Entity organisation**: The infospace maintains a **flat canonical set**
of entities — one markdown file per entity, stored directly in
`output/entities/`. When a chapter mentions an entity that already exists
(detected by slug collision), the duplicate is skipped and the original
definition is kept. This builds a **minimal necessary and sufficient set**
of entities across the entire book.
Per-chapter `*-entities.md` files are **secondary views** that use
MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose
entity content by reference. The same entity (e.g., `division-of-labour.md`)
can appear in multiple chapter views. Editing a canonical entity file
automatically updates every chapter view that references it.
**Deduplication**: The pipeline tells the LLM which entities already exist
(via the `@{existing_entities}` macro in the extraction template) so it
focuses on genuinely new entities. At the file level, slug collisions
are detected and skipped as a safety net.
---
## 3. Designing Schemas
Before writing any prompts we defined **four schemas** that tell the LLM
exactly what sections each output document must contain. This ensures every
generated document is machine-parseable and comparable across chapters.
### Economic Entity Schema (`schemas/economic-entity-schema-v1.0.md`)
Every extracted entity must have:
- **H1 heading** with the entity name
- **Definition** (20-150 words)
- **Source Chapter** citing Book and Chapter
- **Context** — where in Smith's argument the entity appears
- **Economic Domain** (Production, Distribution, Exchange, etc.)
Optional: Smith's Original Wording, Modern Interpretation.
### VSM Mapping Schema (`schemas/vsm-mapping-schema-v1.0.md`)
Every entity-to-VSM mapping must have:
- **H1 heading** in the format `Entity Name -> VSM Concept Name`
- **Economic Entity Reference** and **VSM Concept Reference**
- **Mapping Rationale** (minimum 30 words, grounded in Beer's definitions)
- **Mapping Strength**: Strong, Moderate, or Weak
### Chapter Analysis Schema (`schemas/chapter-analysis-schema-v1.0.md`)
The per-chapter synthesis includes:
- **Chapter Summary** (50-300 words)
- **Entities Extracted** — bulleted list
- **VSM Mappings** — entity, concept, strength
- **VSM Coverage** — explicit assessment of S1 through S5 and S3*
- **Gaps & Observations**
### Metrics Schema (implicit in `assess-metrics` template)
The metrics report computes:
- VSM Concept Coverage (% of S1-S5, recursion, variety, etc.)
- Chapter Coverage (% of 35 chapters processed)
- Entity and Mapping counts
- Terminology Consistency and Cross-reference Integrity scores
**Key insight**: Schemas are not code — they are markdown documents that
the LLM reads as instructions. This means you can iterate on them without
changing any infrastructure.
---
## 4. Writing Prompt Templates
Each template is a markdown file containing instructions for the LLM plus
`@{macro_name}` placeholders that MarkiTect's resolver fills with artifact
content at compile time.
### Template 1: Extract Entities (`templates/extract-entities.md`)
```markdown
# Extract Economic Entities
You are an analytical economist specialising in classical economic theory.
Your task is to extract distinct economic entities from a chapter of
Adam Smith's *The Wealth of Nations*.
## Source Chapter
@{chapter_text}
## Extraction Guidelines
@{extraction_rules}
## VSM Framework Context
@{vsm_framework}
## Existing Entities
@{existing_entities}
## Instructions
[... detailed step-by-step instructions ...]
## Output Format
Output each entity as a separate markdown document, delimited by
`--- ENTITY: <entity-name> ---` markers.
```
The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`,
`existing_entities`) are resolved by looking up artifacts by name in
the relevant information spaces. The `existing_entities` list is
dynamically generated at runtime from the canonical entity files
already on disk, enabling incremental extraction without duplication.
### Template 2: Map to VSM (`templates/map-to-vsm.md`)
Takes `@{entities}` (output from stage 1), `@{vsm_framework}`, and
`@{mapping_rules}` as inputs.
### Template 3: Synthesise Analysis (`templates/synthesize-analysis.md`)
Takes `@{chapter_text}`, `@{entities}` (stage 1 output),
`@{mappings}` (stage 2 output), and `@{vsm_framework}`.
### Template 4: Assess Metrics (`templates/assess-metrics.md`)
Takes `@{all_analyses}` (concatenation of all chapter analyses) and
`@{vsm_framework}`. Runs across the entire infospace, not per-chapter.
**Dependency chain per chapter:**
```
chapter_text ─────┐
extraction_rules ──┤
vsm_framework ────┤
extract-entities
▼ entities
map-to-vsm
▼ mappings
synthesize-analysis
▼ analysis
```
After all chapters are processed, `assess-metrics` evaluates the
complete infospace.
---
## 5. Populating Artifacts
### Source chapters (`artifacts/sources/`)
35 markdown files containing the full public-domain text of each chapter.
Named by convention: `book-1-chapter-01.md` through `book-5-chapter-03.md`,
plus `introduction.md`.
These are loaded into the `infospace-sources` information space.
### Guidelines (`artifacts/guidelines/`)
Two hand-written reference documents:
- **`extraction-rules.md`** — What constitutes an entity, granularity rules,
naming conventions, quality checks.
- **`mapping-rules.md`** — How to map entities to VSM systems, what
constitutes strong/moderate/weak mapping strength.
These are loaded into `infospace-guidelines`.
### VSM reference (`artifacts/vsm-reference/`)
- **`vsm-framework.md`** — Complete description of Beer's VSM (S1-S5, S3*,
recursion, variety, viability, attenuation/amplification, algedonic
signals, autonomy). Includes economic interpretations for each system.
Loaded into `infospace-vsm-reference`.
---
## 6. The Pipeline Script
`process_chapters.py` orchestrates everything. It:
1. Initialises the artifact repository (SQLite) and information spaces
2. Loads all static artifacts (templates, guidelines, VSM reference)
3. For each chapter, runs the three-stage pipeline
4. Optionally calls an LLM to auto-generate outputs
5. Records dependency edges in the graph
6. Commits results to git
### Running a single chapter
```bash
# Manual mode (writes prompts, waits for you to provide output files):
python process_chapters.py --chapter book-1-chapter-05 --no-commit
# Automatic mode via OpenRouter (recommended — fast, real token counts):
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
# Automatic mode via Claude Code CLI:
python process_chapters.py --chapter book-1-chapter-05 --provider claude-code --no-commit
# With a specific model:
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --no-commit
```
### Running a whole book
```bash
python process_chapters.py --book 1 --provider openrouter --no-commit
```
### Running all chapters
```bash
python process_chapters.py --all --provider openrouter --no-commit
```
### Checking progress
```bash
python process_chapters.py --list
```
Prints a table showing which chapters have completed each stage
(entity counts reflect the chapter view's transclusion references,
including shared entities from earlier chapters):
```
Available chapters (35):
Chapter Entities Mappings Analysis
------------------------------ ------------ ------------ ------------
book-1-chapter-01 done (13) done done
book-1-chapter-02 done (7) done done
book-1-chapter-03 done (18) done done
book-1-chapter-04 done (5) done done
book-1-chapter-05 - - -
...
Canonical entity set: 41 unique entities
```
### Assessing metrics
After processing a batch of chapters, run the metrics assessment:
```bash
python process_chapters.py --metrics --provider openrouter --no-commit
```
This concatenates all completed analyses and asks the LLM to evaluate
coverage, consistency, and completeness.
### Dependency statistics
```bash
python process_chapters.py --stats
```
---
## 7. How the LLM Integration Works
The pipeline uses MarkiTect's `markitect.llm` module, which provides two
adapter backends that implement the `LLMAdapter` interface:
| Backend | How it works | Pros | Cons |
|---------|-------------|------|------|
| `openrouter` | HTTP POST to OpenRouter API | Fast, real token counts, any model | Needs API key |
| `claude-code` | Shells out to `claude --print` | No API key needed if CLI installed | Slower, estimated token counts |
### API key setup (OpenRouter)
Place your key in one of these locations (checked in order):
1. Pass `--api-key` on the command line (not yet implemented in the CLI)
2. Set `OPENROUTER_API_KEY` environment variable
3. Create `apikey-openrouter.txt` in the project root (git-ignored)
### What happens per stage
1. The pipeline **resolves** macro placeholders by looking up artifacts
in the repository
2. It **compiles** the template into a complete prompt (macros replaced
with real content)
3. It writes the compiled prompt to `output/<stage>/<chapter>-prompt.md`
for inspection
4. If no output exists yet and an LLM adapter is configured, it
**executes** the prompt
5. **For entity extraction (stage 1):** the pipeline first binds the
list of already-existing entity slugs to `@{existing_entities}` so
the LLM knows what to skip. The LLM returns combined content with
`--- ENTITY: <name> ---` delimiters. The pipeline splits this into
the **flat canonical directory** (`output/entities/<slug>.md`),
skipping any entity whose slug already exists. It then generates the
chapter view file with transclusion directives. The combined content
is never persisted as a single file — canonical entity files are the
source of truth.
6. **For other stages:** the result is written directly to its output file
7. The output is **stored** as a generated artifact in the repository
8. Dependency edges are **recorded** in the graph
---
## 8. Tracking History with Git
Every processed chapter produces a git commit containing:
- Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent
- Canonical entity files (`output/entities/<slug>.md`) — one file per entity,
shared across chapters, first occurrence wins
- Chapter entity views (`<chapter>-entities.md`) — transclusion into the
canonical entities relevant to each chapter
- Generated outputs (`*-mappings.md`, `*-analysis.md`)
This means:
- `git log` shows the chronological order of processing
- `git diff` between commits shows what each chapter contributed
- You can `git bisect` to find where quality degraded
- You can revert a chapter and re-process it with different settings
To let the script auto-commit (default):
```bash
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter
```
To commit manually after reviewing:
```bash
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
# review new entity files in output/entities/ (look for recently modified .md files)
# review chapter view in output/entities/book-1-chapter-05-entities.md
git add examples/infospace-with-history/output/
git commit -m "infospace: process book-1-chapter-05"
```
---
## 9. Cost and Performance
From our measurements processing chapters 3 and 4:
| | Claude Code CLI | OpenRouter |
|---|---|---|
| Time per chapter | ~5 minutes | ~2 minutes |
| Token counts | Estimated (4 chars/tok) | Real (from API) |
| Cost per chapter | ~$0.35 est. | ~$0.07 est. |
**Projected cost for all 35 chapters via OpenRouter:** ~$2.50
(varies by chapter length; Book V chapters are longer).
To reduce costs further, use a cheaper model:
```bash
--provider openrouter --model anthropic/claude-haiku-4-5-20251001
```
---
## 10. Completing the Remaining Chapters
As of now, 4 of 35 chapters are processed (Book I, Chapters 1-4). Here is
how to complete the rest.
### Step-by-step
**1. Process remaining Book I chapters (5-11):**
```bash
python process_chapters.py --book 1 --provider openrouter --no-commit
```
Already-processed chapters are skipped (their chapter view files exist).
Entities from earlier chapters are automatically shared — the LLM is
told which entities already exist and avoids re-extracting them.
**2. Process Books II-V:**
```bash
python process_chapters.py --book 2 --provider openrouter --no-commit
python process_chapters.py --book 3 --provider openrouter --no-commit
python process_chapters.py --book 4 --provider openrouter --no-commit
python process_chapters.py --book 5 --provider openrouter --no-commit
```
Or all at once:
```bash
python process_chapters.py --all --provider openrouter --no-commit
```
**3. Run metrics after each book (or at the end):**
```bash
python process_chapters.py --metrics --provider openrouter --no-commit
```
**4. Commit the results:**
```bash
git add examples/infospace-with-history/output/
git commit -m "infospace: process all remaining chapters"
```
**5. Review the metrics report:**
Open `output/metrics/metrics-report.md`. It will show:
- Which VSM concepts (S1-S5, recursion, variety, etc.) now have mappings
- Total entity and mapping counts
- Consistency scores
- Recommendations for gaps
### Expected progression
| After | Chapters | Expected coverage |
|-------|----------|-------------------|
| Book I (11 ch.) | 11/35 | S1, S2, S4 strong; S3 emerging |
| Books I-II (16 ch.) | 16/35 | S3 (capital control) covered |
| Books I-III (20 ch.) | 20/35 | Historical patterns add depth |
| Books I-IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging |
| All (35 ch.) | 35/35 | Full coverage, S3* and algedonic signals likely from Book V |
Book V (public revenue, taxation, sovereign duties) is expected to
fill the remaining gaps in S3*, S5, and regulatory concepts.
---
## 11. Quality Improvement Loop
The infospace is designed to be **iteratively refined**:
1. **Process chapters** — run the pipeline
2. **Assess metrics** — identify gaps in VSM coverage and consistency
3. **Refine guidelines** — update `extraction-rules.md` or
`mapping-rules.md` to address identified weaknesses
4. **Re-process** — delete output files for specific chapters and re-run
with improved guidelines
5. **Compare** — use git diff to see how the refined guidelines changed
the output
Example: if metrics show that S3* (Audit) is consistently missed, you
could add a paragraph to `extraction-rules.md` explicitly asking the LLM
to look for audit, inspection, and oversight mechanisms.
To re-process a specific chapter, remove its chapter view and downstream
outputs. Note: canonical entity files in `output/entities/` are shared
across chapters — only delete individual entity files if you want them
re-extracted from scratch.
```bash
rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
```
To also re-extract specific entities, delete their canonical files first:
```bash
rm -f examples/infospace-with-history/output/entities/extent-of-the-market.md
# then re-process the chapter as above
```
---
## 12. Infrastructure Issues Found and Fixed
During development we documented three issues with the MarkiTect
infrastructure in `INFRA-TASKS.md`:
1. **Artifact repo doesn't store content** — the resolver returned
placeholder text instead of actual artifact content.
2. **ContentMacro `raw_text` defaults to `""`** — caused silent data
corruption when macros were constructed programmatically.
3. **No `@{target}` syntax in MacroParser** — macros had to be
constructed manually rather than auto-detected from template text.
All three have been fixed in the markitect infrastructure. The pipeline
script (`process_chapters.py`) has been refactored to use the fixed
infrastructure directly — the local content cache, manual macro
construction, and manual substitution workarounds have been removed.
See `INFRA-TASKS.md` for details on each fix.
---
## 13. Adapting This Pattern to Your Own Project
To build your own infospace using this pattern:
1. **Choose your source corpus** — any collection of documents you want
to transform into structured knowledge.
2. **Define your target ontology** — what concepts, relationships, or
categories you want to extract (our VSM is just one example).
3. **Write schemas** — markdown documents defining the required sections
and validation rules for each output type.
4. **Write extraction guidelines** — rules that tell the LLM what to
look for and how to handle edge cases.
5. **Create prompt templates** — use `@{macro}` syntax to inject source
text and guidelines at compile time.
6. **Build your pipeline** — follow `process_chapters.py` as a reference
for loading artifacts, resolving templates, and calling the LLM.
7. **Process incrementally** — work through your corpus one document at a
time, tracking everything in git.
8. **Measure and refine** — define metrics, assess them periodically,
and update your guidelines when gaps appear.
The key architectural insight is that **schemas and guidelines are
artifacts** — they live in the same repository as your source text and
can be versioned, diffed, and refined just like code.