markitect-main/examples/infospace-with-history/TUTORIAL.md

# Building an Infospace with History — Tutorial

This tutorial walks through how we built a structured **information space**
(infospace) from Adam Smith's *The Wealth of Nations*, mapping classical
economic concepts to Stafford Beer's **Viable System Model** (VSM), using
MarkiTect's prompt dependency resolution and LLM integration.

By the end you will understand how to:

1. Design schemas that scaffold structured LLM output
2. Write prompt templates with dependency injection (`@{macro}` syntax)
3. Populate source artifacts and reference material
4. Run an incremental, chapter-by-chapter pipeline
5. Track every change through git history
6. Measure completeness and consistency with metrics
7. Continue the work to process remaining chapters

---

## 1. The Idea

We want to transform a large body of text — the full public-domain text of
*The Wealth of Nations* (5 books, 35 chapters) — into a **curated
collection of economic concepts and entities**, each mapped to the VSM.

The challenge: this is too much for a single prompt. The text is hundreds of
thousands of words. We need to work **incrementally**, one chapter at a time,
building up the infospace and tracking progress.

MarkiTect's prompt dependency resolution lets us define **templates** with
`@{placeholder}` macros that are filled from an artifact repository at
execution time. The pipeline compiles each template into a complete prompt,
sends it to an LLM, and stores the output — all tracked by git.

---

## 2. Project Layout

```
examples/infospace-with-history/
│
├── README.md                   # Project brief
├── TUTORIAL.md                 # This file
├── INFRA-TASKS.md              # Infrastructure issues found during the experiment
├── process_chapters.py         # Pipeline script
│
├── schemas/                    # Output structure definitions
│   ├── economic-entity-schema-v1.0.md
│   ├── vsm-concept-schema-v1.0.md
│   ├── vsm-mapping-schema-v1.0.md
│   └── chapter-analysis-schema-v1.0.md
│
├── templates/                  # Prompt templates (with @{macro} placeholders)
│   ├── extract-entities.md
│   ├── map-to-vsm.md
│   ├── synthesize-analysis.md
│   └── assess-metrics.md
│
├── artifacts/                  # Input artifacts
│   ├── sources/                # Chapter text (35 files)
│   ├── guidelines/             # Extraction and mapping rules
│   └── vsm-reference/         # VSM framework definition
│
└── output/                     # Generated artifacts (LLM outputs)
    ├── entities/               # Flat canonical entity set + chapter views
    │   ├── division-of-labour.md        # Canonical entity file (PRIMARY)
    │   ├── exchange.md
    │   ├── commercial-society.md
    │   ├── ...
    │   ├── book-1-chapter-01-entities.md  # Chapter view (transclusion)
    │   ├── book-1-chapter-01-prompt.md   # Compiled prompt
    │   ├── book-1-chapter-04-entities.md  # Also references division-of-labour.md
    │   └── ...
    ├── mappings/               # Per-chapter VSM mappings
    ├── analyses/               # Per-chapter synthesised analyses
    └── metrics/                # Cross-chapter metrics reports
```

**Entity organisation**: The infospace maintains a **flat canonical set**
of entities — one markdown file per entity, stored directly in
`output/entities/`. When a chapter mentions an entity that already exists
(detected by slug collision), the duplicate is skipped and the original
definition is kept. This builds a **minimal necessary and sufficient set**
of entities across the entire book.

Per-chapter `*-entities.md` files are **secondary views** that use
MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose
entity content by reference. The same entity (e.g., `division-of-labour.md`)
can appear in multiple chapter views. Editing a canonical entity file
automatically updates every chapter view that references it.

**Deduplication**: The pipeline tells the LLM which entities already exist
(via the `@{existing_entities}` macro in the extraction template) so it
focuses on genuinely new entities. At the file level, slug collisions
are detected and skipped as a safety net.

---

## 3. Designing Schemas

Before writing any prompts we defined **four schemas** that tell the LLM
exactly what sections each output document must contain. This ensures every
generated document is machine-parseable and comparable across chapters.

### Economic Entity Schema (`schemas/economic-entity-schema-v1.0.md`)

Every extracted entity must have:

- **H1 heading** with the entity name
- **Definition** (20-150 words)
- **Source Chapter** citing Book and Chapter
- **Context** — where in Smith's argument the entity appears
- **Economic Domain** (Production, Distribution, Exchange, etc.)

Optional: Smith's Original Wording, Modern Interpretation.

### VSM Mapping Schema (`schemas/vsm-mapping-schema-v1.0.md`)

Every entity-to-VSM mapping must have:

- **H1 heading** in the format `Entity Name -> VSM Concept Name`
- **Economic Entity Reference** and **VSM Concept Reference**
- **Mapping Rationale** (minimum 30 words, grounded in Beer's definitions)
- **Mapping Strength**: Strong, Moderate, or Weak

### Chapter Analysis Schema (`schemas/chapter-analysis-schema-v1.0.md`)

The per-chapter synthesis includes:

- **Chapter Summary** (50-300 words)
- **Entities Extracted** — bulleted list
- **VSM Mappings** — entity, concept, strength
- **VSM Coverage** — explicit assessment of S1 through S5 and S3*
- **Gaps & Observations**

### Metrics Schema (implicit in `assess-metrics` template)

The metrics report computes:

- VSM Concept Coverage (% of S1-S5, recursion, variety, etc.)
- Chapter Coverage (% of 35 chapters processed)
- Entity and Mapping counts
- Terminology Consistency and Cross-reference Integrity scores

**Key insight**: Schemas are not code — they are markdown documents that
the LLM reads as instructions. This means you can iterate on them without
changing any infrastructure.

---

## 4. Writing Prompt Templates

Each template is a markdown file containing instructions for the LLM plus
`@{macro_name}` placeholders that MarkiTect's resolver fills with artifact
content at compile time.

### Template 1: Extract Entities (`templates/extract-entities.md`)

```markdown
# Extract Economic Entities

You are an analytical economist specialising in classical economic theory.
Your task is to extract distinct economic entities from a chapter of
Adam Smith's *The Wealth of Nations*.

## Source Chapter

@{chapter_text}

## Extraction Guidelines

@{extraction_rules}

## VSM Framework Context

@{vsm_framework}

## Existing Entities

@{existing_entities}

## Instructions
[... detailed step-by-step instructions ...]

## Output Format

Output each entity as a separate markdown document, delimited by
`--- ENTITY: <entity-name> ---` markers.
```

The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`,
`existing_entities`) are resolved by looking up artifacts by name in
the relevant information spaces. The `existing_entities` list is
dynamically generated at runtime from the canonical entity files
already on disk, enabling incremental extraction without duplication.

### Template 2: Map to VSM (`templates/map-to-vsm.md`)

Takes `@{entities}` (output from stage 1), `@{vsm_framework}`, and
`@{mapping_rules}` as inputs.

### Template 3: Synthesise Analysis (`templates/synthesize-analysis.md`)

Takes `@{chapter_text}`, `@{entities}` (stage 1 output),
`@{mappings}` (stage 2 output), and `@{vsm_framework}`.

### Template 4: Assess Metrics (`templates/assess-metrics.md`)

Takes `@{all_analyses}` (concatenation of all chapter analyses) and
`@{vsm_framework}`. Runs across the entire infospace, not per-chapter.

**Dependency chain per chapter:**

```
chapter_text ─────┐
extraction_rules ──┤
vsm_framework ────┤
                   ▼
           extract-entities
                   │
                   ▼ entities
           map-to-vsm
                   │
                   ▼ mappings
           synthesize-analysis
                   │
                   ▼ analysis
```

After all chapters are processed, `assess-metrics` evaluates the
complete infospace.

---

## 5. Populating Artifacts

### Source chapters (`artifacts/sources/`)

35 markdown files containing the full public-domain text of each chapter.
Named by convention: `book-1-chapter-01.md` through `book-5-chapter-03.md`,
plus `introduction.md`.

These are loaded into the `infospace-sources` information space.

### Guidelines (`artifacts/guidelines/`)

Two hand-written reference documents:

- **`extraction-rules.md`** — What constitutes an entity, granularity rules,
  naming conventions, quality checks.
- **`mapping-rules.md`** — How to map entities to VSM systems, what
  constitutes strong/moderate/weak mapping strength.

These are loaded into `infospace-guidelines`.

### VSM reference (`artifacts/vsm-reference/`)

- **`vsm-framework.md`** — Complete description of Beer's VSM (S1-S5, S3*,
  recursion, variety, viability, attenuation/amplification, algedonic
  signals, autonomy). Includes economic interpretations for each system.

Loaded into `infospace-vsm-reference`.

---

## 6. The Pipeline Script

`process_chapters.py` orchestrates everything. It:

1. Initialises the artifact repository (SQLite) and information spaces
2. Loads all static artifacts (templates, guidelines, VSM reference)
3. For each chapter, runs the three-stage pipeline
4. Optionally calls an LLM to auto-generate outputs
5. Records dependency edges in the graph
6. Commits results to git

### Running a single chapter

```bash
# Manual mode (writes prompts, waits for you to provide output files):
python process_chapters.py --chapter book-1-chapter-05 --no-commit

# Automatic mode via OpenRouter (recommended — fast, real token counts):
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit

# Automatic mode via Claude Code CLI:
python process_chapters.py --chapter book-1-chapter-05 --provider claude-code --no-commit

# With a specific model:
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --no-commit
```

### Running a whole book

```bash
python process_chapters.py --book 1 --provider openrouter --no-commit
```

### Running all chapters

```bash
python process_chapters.py --all --provider openrouter --no-commit
```

### Checking progress

```bash
python process_chapters.py --list
```

Prints a table showing which chapters have completed each stage
(entity counts reflect the chapter view's transclusion references,
including shared entities from earlier chapters):

```
Available chapters (35):

  Chapter                        Entities     Mappings     Analysis
  ------------------------------ ------------ ------------ ------------
  book-1-chapter-01              done (13)    done         done
  book-1-chapter-02              done (7)     done         done
  book-1-chapter-03              done (18)    done         done
  book-1-chapter-04              done (5)     done         done
  book-1-chapter-05              -            -            -
  ...

  Canonical entity set: 41 unique entities
```

### Assessing metrics

After processing a batch of chapters, run the metrics assessment:

```bash
python process_chapters.py --metrics --provider openrouter --no-commit
```

This concatenates all completed analyses and asks the LLM to evaluate
coverage, consistency, and completeness.

### Dependency statistics

```bash
python process_chapters.py --stats
```

---

## 7. How the LLM Integration Works

The pipeline uses MarkiTect's `markitect.llm` module, which provides two
adapter backends that implement the `LLMAdapter` interface:

| Backend | How it works | Pros | Cons |
|---------|-------------|------|------|
| `openrouter` | HTTP POST to OpenRouter API | Fast, real token counts, any model | Needs API key |
| `claude-code` | Shells out to `claude --print` | No API key needed if CLI installed | Slower, estimated token counts |

### API key setup (OpenRouter)

Place your key in one of these locations (checked in order):

1. Pass `--api-key` on the command line (not yet implemented in the CLI)
2. Set `OPENROUTER_API_KEY` environment variable
3. Create `apikey-openrouter.txt` in the project root (git-ignored)

### What happens per stage

1. The pipeline **resolves** macro placeholders by looking up artifacts
   in the repository
2. It **compiles** the template into a complete prompt (macros replaced
   with real content)
3. It writes the compiled prompt to `output/<stage>/<chapter>-prompt.md`
   for inspection
4. If no output exists yet and an LLM adapter is configured, it
   **executes** the prompt
5. **For entity extraction (stage 1):** the pipeline first binds the
   list of already-existing entity slugs to `@{existing_entities}` so
   the LLM knows what to skip. The LLM returns combined content with
   `--- ENTITY: <name> ---` delimiters. The pipeline splits this into
   the **flat canonical directory** (`output/entities/<slug>.md`),
   skipping any entity whose slug already exists. It then generates the
   chapter view file with transclusion directives. The combined content
   is never persisted as a single file — canonical entity files are the
   source of truth.
6. **For other stages:** the result is written directly to its output file
7. The output is **stored** as a generated artifact in the repository
8. Dependency edges are **recorded** in the graph

---

## 8. Tracking History with Git

Every processed chapter produces a git commit containing:

- Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent
- Canonical entity files (`output/entities/<slug>.md`) — one file per entity,
  shared across chapters, first occurrence wins
- Chapter entity views (`<chapter>-entities.md`) — transclusion into the
  canonical entities relevant to each chapter
- Generated outputs (`*-mappings.md`, `*-analysis.md`)

This means:

- `git log` shows the chronological order of processing
- `git diff` between commits shows what each chapter contributed
- You can `git bisect` to find where quality degraded
- You can revert a chapter and re-process it with different settings

To let the script auto-commit (default):

```bash
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter
```

To commit manually after reviewing:

```bash
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
# review new entity files in output/entities/ (look for recently modified .md files)
# review chapter view in output/entities/book-1-chapter-05-entities.md
git add examples/infospace-with-history/output/
git commit -m "infospace: process book-1-chapter-05"
```

---

## 9. Cost and Performance

From our measurements processing chapters 3 and 4:

| | Claude Code CLI | OpenRouter |
|---|---|---|
| Time per chapter | ~5 minutes | ~2 minutes |
| Token counts | Estimated (4 chars/tok) | Real (from API) |
| Cost per chapter | ~$0.35 est. | ~$0.07 est. |

**Projected cost for all 35 chapters via OpenRouter:** ~$2.50
(varies by chapter length; Book V chapters are longer).

To reduce costs further, use a cheaper model:

```bash
--provider openrouter --model anthropic/claude-haiku-4-5-20251001
```

---

## 10. Completing the Remaining Chapters

As of now, 4 of 35 chapters are processed (Book I, Chapters 1-4). Here is
how to complete the rest.

### Step-by-step

**1. Process remaining Book I chapters (5-11):**

```bash
python process_chapters.py --book 1 --provider openrouter --no-commit
```

Already-processed chapters are skipped (their chapter view files exist).
Entities from earlier chapters are automatically shared — the LLM is
told which entities already exist and avoids re-extracting them.

**2. Process Books II-V:**

```bash
python process_chapters.py --book 2 --provider openrouter --no-commit
python process_chapters.py --book 3 --provider openrouter --no-commit
python process_chapters.py --book 4 --provider openrouter --no-commit
python process_chapters.py --book 5 --provider openrouter --no-commit
```

Or all at once:

```bash
python process_chapters.py --all --provider openrouter --no-commit
```

**3. Run metrics after each book (or at the end):**

```bash
python process_chapters.py --metrics --provider openrouter --no-commit
```

**4. Commit the results:**

```bash
git add examples/infospace-with-history/output/
git commit -m "infospace: process all remaining chapters"
```

**5. Review the metrics report:**

Open `output/metrics/metrics-report.md`. It will show:

- Which VSM concepts (S1-S5, recursion, variety, etc.) now have mappings
- Total entity and mapping counts
- Consistency scores
- Recommendations for gaps

### Expected progression

| After | Chapters | Expected coverage |
|-------|----------|-------------------|
| Book I (11 ch.) | 11/35 | S1, S2, S4 strong; S3 emerging |
| Books I-II (16 ch.) | 16/35 | S3 (capital control) covered |
| Books I-III (20 ch.) | 20/35 | Historical patterns add depth |
| Books I-IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging |
| All (35 ch.) | 35/35 | Full coverage, S3* and algedonic signals likely from Book V |

Book V (public revenue, taxation, sovereign duties) is expected to
fill the remaining gaps in S3*, S5, and regulatory concepts.

---

## 11. Quality Improvement Loop

The infospace is designed to be **iteratively refined**:

1. **Process chapters** — run the pipeline
2. **Assess metrics** — identify gaps in VSM coverage and consistency
3. **Refine guidelines** — update `extraction-rules.md` or
   `mapping-rules.md` to address identified weaknesses
4. **Re-process** — delete output files for specific chapters and re-run
   with improved guidelines
5. **Compare** — use git diff to see how the refined guidelines changed
   the output

Example: if metrics show that S3* (Audit) is consistently missed, you
could add a paragraph to `extraction-rules.md` explicitly asking the LLM
to look for audit, inspection, and oversight mechanisms.

To re-process a specific chapter, remove its chapter view and downstream
outputs. Note: canonical entity files in `output/entities/` are shared
across chapters — only delete individual entity files if you want them
re-extracted from scratch.

```bash
rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
```

To also re-extract specific entities, delete their canonical files first:

```bash
rm -f examples/infospace-with-history/output/entities/extent-of-the-market.md
# then re-process the chapter as above
```

---

## 12. Infrastructure Issues Found and Fixed

During development we documented three issues with the MarkiTect
infrastructure in `INFRA-TASKS.md`:

1. **Artifact repo doesn't store content** — the resolver returned
   placeholder text instead of actual artifact content.
2. **ContentMacro `raw_text` defaults to `""`** — caused silent data
   corruption when macros were constructed programmatically.
3. **No `@{target}` syntax in MacroParser** — macros had to be
   constructed manually rather than auto-detected from template text.

All three have been fixed in the markitect infrastructure. The pipeline
script (`process_chapters.py`) has been refactored to use the fixed
infrastructure directly — the local content cache, manual macro
construction, and manual substitution workarounds have been removed.
See `INFRA-TASKS.md` for details on each fix.

---

## 13. Adapting This Pattern to Your Own Project

To build your own infospace using this pattern:

1. **Choose your source corpus** — any collection of documents you want
   to transform into structured knowledge.
2. **Define your target ontology** — what concepts, relationships, or
   categories you want to extract (our VSM is just one example).
3. **Write schemas** — markdown documents defining the required sections
   and validation rules for each output type.
4. **Write extraction guidelines** — rules that tell the LLM what to
   look for and how to handle edge cases.
5. **Create prompt templates** — use `@{macro}` syntax to inject source
   text and guidelines at compile time.
6. **Build your pipeline** — follow `process_chapters.py` as a reference
   for loading artifacts, resolving templates, and calling the LLM.
7. **Process incrementally** — work through your corpus one document at a
   time, tracking everything in git.
8. **Measure and refine** — define metrics, assess them periodically,
   and update your guidelines when gaps appear.

The key architectural insight is that **schemas and guidelines are
artifacts** — they live in the same repository as your source text and
can be versioned, diffed, and refined just like code.