The SQLite artifact database is a derived cache regenerable from committed files — no LLM calls needed. Added tutorial section explaining why it is excluded and how to rebuild it after a fresh clone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
698 lines
26 KiB
Markdown
698 lines
26 KiB
Markdown
# Building an Infospace with History — Tutorial
|
|
|
|
This tutorial walks through how we built a structured **information space**
|
|
(infospace) from Adam Smith's *The Wealth of Nations*, mapping classical
|
|
economic concepts to Stafford Beer's **Viable System Model** (VSM), using
|
|
MarkiTect's prompt dependency resolution and LLM integration.
|
|
|
|
By the end you will understand how to:
|
|
|
|
1. Design schemas that scaffold structured LLM output
|
|
2. Write prompt templates with dependency injection (`@{macro}` syntax)
|
|
3. Populate source artifacts and reference material
|
|
4. Run an incremental, chapter-by-chapter pipeline
|
|
5. Track every change through git history
|
|
6. Measure completeness and consistency with metrics
|
|
7. Continue the work to process remaining chapters
|
|
|
|
---
|
|
|
|
## 1. The Idea
|
|
|
|
We want to transform a large body of text — the full public-domain text of
|
|
*The Wealth of Nations* (5 books, 35 chapters) — into a **curated
|
|
collection of economic concepts and entities**, each mapped to the VSM.
|
|
|
|
The challenge: this is too much for a single prompt. The text is hundreds of
|
|
thousands of words. We need to work **incrementally**, one chapter at a time,
|
|
building up the infospace and tracking progress.
|
|
|
|
MarkiTect's prompt dependency resolution lets us define **templates** with
|
|
`@{placeholder}` macros that are filled from an artifact repository at
|
|
execution time. The pipeline compiles each template into a complete prompt,
|
|
sends it to an LLM, and stores the output — all tracked by git.
|
|
|
|
---
|
|
|
|
## 2. Project Layout
|
|
|
|
```
|
|
examples/infospace-with-history/
|
|
│
|
|
├── README.md # Project brief
|
|
├── TUTORIAL.md # This file
|
|
├── INFRA-TASKS.md # Infrastructure issues found during the experiment
|
|
├── process_chapters.py # Pipeline script
|
|
├── infospace.db # SQLite artifact database (generated, not in git)
|
|
│
|
|
├── schemas/ # Output structure definitions
|
|
│ ├── economic-entity-schema-v1.0.md
|
|
│ ├── vsm-concept-schema-v1.0.md
|
|
│ ├── vsm-mapping-schema-v1.0.md
|
|
│ └── chapter-analysis-schema-v1.0.md
|
|
│
|
|
├── templates/ # Prompt templates (with @{macro} placeholders)
|
|
│ ├── extract-entities.md
|
|
│ ├── map-to-vsm.md
|
|
│ ├── synthesize-analysis.md
|
|
│ └── assess-metrics.md
|
|
│
|
|
├── artifacts/ # Input artifacts
|
|
│ ├── sources/ # Chapter text (35 files)
|
|
│ ├── guidelines/ # Extraction and mapping rules
|
|
│ └── vsm-reference/ # VSM framework definition
|
|
│
|
|
└── output/ # Generated artifacts (LLM outputs)
|
|
├── entities/ # Flat canonical entity set + chapter views
|
|
│ ├── division-of-labour.md # Canonical entity file (PRIMARY)
|
|
│ ├── exchange.md
|
|
│ ├── commercial-society.md
|
|
│ ├── ...
|
|
│ ├── book-1-chapter-01-entities.md # Chapter view (transclusion)
|
|
│ ├── book-1-chapter-01-prompt.md # Compiled prompt
|
|
│ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md
|
|
│ └── ...
|
|
├── mappings/ # Per-chapter VSM mappings
|
|
├── analyses/ # Per-chapter synthesised analyses
|
|
└── metrics/ # Cross-chapter metrics reports
|
|
```
|
|
|
|
**Entity organisation**: The infospace maintains a **flat canonical set**
|
|
of entities — one markdown file per entity, stored directly in
|
|
`output/entities/`. When a chapter mentions an entity that already exists
|
|
(detected by slug collision), the duplicate is skipped and the original
|
|
definition is kept. This builds a **minimal necessary and sufficient set**
|
|
of entities across the entire book.
|
|
|
|
Per-chapter `*-entities.md` files are **secondary views** that use
|
|
MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose
|
|
entity content by reference. The same entity (e.g., `division-of-labour.md`)
|
|
can appear in multiple chapter views. Editing a canonical entity file
|
|
automatically updates every chapter view that references it.
|
|
|
|
**Deduplication**: The pipeline tells the LLM which entities already exist
|
|
(via the `@{existing_entities}` macro in the extraction template) so it
|
|
focuses on genuinely new entities. At the file level, slug collisions
|
|
are detected and skipped as a safety net.
|
|
|
|
**Entity lifecycle**: Once an entity enters the canonical set, it is
|
|
**never silently deleted**. Entities may only be retired when they have been
|
|
subsumed by another entity, found to partially map onto other entities, or
|
|
otherwise determined to be redundant. Retired entities are **archived** —
|
|
moved to `output/entities/archive/` with a dated header documenting the
|
|
reason. This preserves the intellectual history of the infospace: every
|
|
decision to drop an entity is a deliberate, documented learning.
|
|
|
|
```bash
|
|
# Archive an entity that has been subsumed by another
|
|
python process_chapters.py --archive-entity enlarged-monopoly \
|
|
--reason "Subsumed by monopoly-price — both describe the same market distortion"
|
|
|
|
# The archived file retains its full content with an explanatory header
|
|
cat output/entities/archive/enlarged-monopoly.md
|
|
```
|
|
|
|
The `--list` command shows both the active canonical set and the archive count.
|
|
|
|
---
|
|
|
|
## 3. Designing Schemas
|
|
|
|
Before writing any prompts we defined **four schemas** that tell the LLM
|
|
exactly what sections each output document must contain. This ensures every
|
|
generated document is machine-parseable and comparable across chapters.
|
|
|
|
### Economic Entity Schema (`schemas/economic-entity-schema-v1.0.md`)
|
|
|
|
Every extracted entity must have:
|
|
|
|
- **H1 heading** with the entity name
|
|
- **Definition** (20-150 words)
|
|
- **Source Chapter** citing Book and Chapter
|
|
- **Context** — where in Smith's argument the entity appears
|
|
- **Economic Domain** (Production, Distribution, Exchange, etc.)
|
|
|
|
Optional: Smith's Original Wording, Modern Interpretation.
|
|
|
|
### VSM Mapping Schema (`schemas/vsm-mapping-schema-v1.0.md`)
|
|
|
|
Every entity-to-VSM mapping must have:
|
|
|
|
- **H1 heading** in the format `Entity Name -> VSM Concept Name`
|
|
- **Economic Entity Reference** and **VSM Concept Reference**
|
|
- **Mapping Rationale** (minimum 30 words, grounded in Beer's definitions)
|
|
- **Mapping Strength**: Strong, Moderate, or Weak
|
|
|
|
### Chapter Analysis Schema (`schemas/chapter-analysis-schema-v1.0.md`)
|
|
|
|
The per-chapter synthesis includes:
|
|
|
|
- **Chapter Summary** (50-300 words)
|
|
- **Entities Extracted** — bulleted list
|
|
- **VSM Mappings** — entity, concept, strength
|
|
- **VSM Coverage** — explicit assessment of S1 through S5 and S3*
|
|
- **Gaps & Observations**
|
|
|
|
### Metrics Schema (implicit in `assess-metrics` template)
|
|
|
|
The metrics report computes:
|
|
|
|
- VSM Concept Coverage (% of S1-S5, recursion, variety, etc.)
|
|
- Chapter Coverage (% of 35 chapters processed)
|
|
- Entity and Mapping counts
|
|
- Terminology Consistency and Cross-reference Integrity scores
|
|
|
|
**Key insight**: Schemas are not code — they are markdown documents that
|
|
the LLM reads as instructions. This means you can iterate on them without
|
|
changing any infrastructure.
|
|
|
|
---
|
|
|
|
## 4. Writing Prompt Templates
|
|
|
|
Each template is a markdown file containing instructions for the LLM plus
|
|
`@{macro_name}` placeholders that MarkiTect's resolver fills with artifact
|
|
content at compile time.
|
|
|
|
### Template 1: Extract Entities (`templates/extract-entities.md`)
|
|
|
|
```markdown
|
|
# Extract Economic Entities
|
|
|
|
You are an analytical economist specialising in classical economic theory.
|
|
Your task is to extract distinct economic entities from a chapter of
|
|
Adam Smith's *The Wealth of Nations*.
|
|
|
|
## Source Chapter
|
|
|
|
@{chapter_text}
|
|
|
|
## Extraction Guidelines
|
|
|
|
@{extraction_rules}
|
|
|
|
## VSM Framework Context
|
|
|
|
@{vsm_framework}
|
|
|
|
## Existing Entities
|
|
|
|
@{existing_entities}
|
|
|
|
## Instructions
|
|
[... detailed step-by-step instructions ...]
|
|
|
|
## Output Format
|
|
|
|
Output each entity as a separate markdown document, delimited by
|
|
`--- ENTITY: <entity-name> ---` markers.
|
|
```
|
|
|
|
The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`,
|
|
`existing_entities`) are resolved by looking up artifacts by name in
|
|
the relevant information spaces. The `existing_entities` list is
|
|
dynamically generated at runtime from the canonical entity files
|
|
already on disk, enabling incremental extraction without duplication.
|
|
|
|
### Template 2: Map to VSM (`templates/map-to-vsm.md`)
|
|
|
|
Takes `@{entities}` (output from stage 1), `@{vsm_framework}`, and
|
|
`@{mapping_rules}` as inputs.
|
|
|
|
### Template 3: Synthesise Analysis (`templates/synthesize-analysis.md`)
|
|
|
|
Takes `@{chapter_text}`, `@{entities}` (stage 1 output),
|
|
`@{mappings}` (stage 2 output), and `@{vsm_framework}`.
|
|
|
|
### Template 4: Assess Metrics (`templates/assess-metrics.md`)
|
|
|
|
Takes `@{all_analyses}` (concatenation of all chapter analyses) and
|
|
`@{vsm_framework}`. Runs across the entire infospace, not per-chapter.
|
|
|
|
**Dependency chain per chapter:**
|
|
|
|
```
|
|
chapter_text ─────┐
|
|
extraction_rules ──┤
|
|
vsm_framework ────┤
|
|
▼
|
|
extract-entities
|
|
│
|
|
▼ entities
|
|
map-to-vsm
|
|
│
|
|
▼ mappings
|
|
synthesize-analysis
|
|
│
|
|
▼ analysis
|
|
```
|
|
|
|
After all chapters are processed, `assess-metrics` evaluates the
|
|
complete infospace.
|
|
|
|
---
|
|
|
|
## 5. Populating Artifacts
|
|
|
|
### Source chapters (`artifacts/sources/`)
|
|
|
|
35 markdown files containing the full public-domain text of each chapter.
|
|
Named by convention: `book-1-chapter-01.md` through `book-5-chapter-03.md`,
|
|
plus `introduction.md`.
|
|
|
|
These are loaded into the `infospace-sources` information space.
|
|
|
|
### Guidelines (`artifacts/guidelines/`)
|
|
|
|
Two hand-written reference documents:
|
|
|
|
- **`extraction-rules.md`** — What constitutes an entity, granularity rules,
|
|
naming conventions, quality checks.
|
|
- **`mapping-rules.md`** — How to map entities to VSM systems, what
|
|
constitutes strong/moderate/weak mapping strength.
|
|
|
|
These are loaded into `infospace-guidelines`.
|
|
|
|
### VSM reference (`artifacts/vsm-reference/`)
|
|
|
|
- **`vsm-framework.md`** — Complete description of Beer's VSM (S1-S5, S3*,
|
|
recursion, variety, viability, attenuation/amplification, algedonic
|
|
signals, autonomy). Includes economic interpretations for each system.
|
|
|
|
Loaded into `infospace-vsm-reference`.
|
|
|
|
---
|
|
|
|
## 6. The Pipeline Script
|
|
|
|
`process_chapters.py` orchestrates everything. It:
|
|
|
|
1. Initialises the artifact repository (SQLite) and information spaces
|
|
2. Loads all static artifacts (templates, guidelines, VSM reference)
|
|
3. For each chapter, runs the three-stage pipeline
|
|
4. Optionally calls an LLM to auto-generate outputs
|
|
5. Records dependency edges in the graph
|
|
6. Commits results to git
|
|
|
|
### Running a single chapter
|
|
|
|
```bash
|
|
# Manual mode (writes prompts, waits for you to provide output files):
|
|
python process_chapters.py --chapter book-1-chapter-05 --no-commit
|
|
|
|
# Automatic mode via OpenRouter (recommended — fast, real token counts):
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
|
|
|
|
# Automatic mode via Gemini free tier:
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider gemini --no-commit
|
|
|
|
# Automatic mode via Claude Code CLI:
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider claude-code --no-commit
|
|
|
|
# With a specific model:
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --no-commit
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider gemini --model gemini-2.5-flash-lite --no-commit
|
|
```
|
|
|
|
### Running a whole book
|
|
|
|
```bash
|
|
python process_chapters.py --book 1 --provider openrouter --no-commit
|
|
```
|
|
|
|
### Running all chapters
|
|
|
|
```bash
|
|
python process_chapters.py --all --provider openrouter --no-commit
|
|
```
|
|
|
|
### Checking progress
|
|
|
|
```bash
|
|
python process_chapters.py --list
|
|
```
|
|
|
|
Prints a table showing which chapters have completed each stage
|
|
(entity counts reflect the chapter view's transclusion references,
|
|
including shared entities from earlier chapters):
|
|
|
|
```
|
|
Available chapters (35):
|
|
|
|
Chapter Entities Mappings Analysis
|
|
------------------------------ ------------ ------------ ------------
|
|
book-1-chapter-01 done (13) done done
|
|
book-1-chapter-02 done (7) done done
|
|
book-1-chapter-03 done (18) done done
|
|
book-1-chapter-04 done (5) done done
|
|
book-1-chapter-05 done (1) done done
|
|
...
|
|
|
|
Canonical entity set: 42 unique entities
|
|
```
|
|
|
|
### Assessing metrics
|
|
|
|
After processing a batch of chapters, run the metrics assessment:
|
|
|
|
```bash
|
|
python process_chapters.py --metrics --provider openrouter --no-commit
|
|
```
|
|
|
|
This concatenates all completed analyses and asks the LLM to evaluate
|
|
coverage, consistency, and completeness.
|
|
|
|
### Dependency statistics
|
|
|
|
```bash
|
|
python process_chapters.py --stats
|
|
```
|
|
|
|
---
|
|
|
|
## 7. The Artifact Database (`infospace.db`)
|
|
|
|
The pipeline stores all artifacts (source text, templates, guidelines, generated
|
|
outputs) and their dependency edges in a local SQLite database —
|
|
`infospace.db`. This file is **not checked into git** because it is a derived
|
|
cache that can be regenerated deterministically from the files already in the
|
|
repository.
|
|
|
|
### Why it is excluded
|
|
|
|
- **Binary format** — SQLite databases don't produce meaningful diffs and
|
|
would bloat the git history with every pipeline run.
|
|
- **Fully derived** — every piece of data in the database originates from
|
|
markdown files that *are* tracked in git (sources, templates, schemas,
|
|
guidelines, and generated output).
|
|
- **Reproducible** — re-running the pipeline rebuilds the database from
|
|
scratch without any LLM calls, because each stage checks for existing
|
|
output files on disk before invoking the LLM.
|
|
|
|
### How to regenerate it
|
|
|
|
If `infospace.db` is missing (e.g. after a fresh clone), rebuild it by
|
|
re-running the pipeline over the chapters that already have output on disk:
|
|
|
|
```bash
|
|
# Regenerate the database from existing output files (no LLM calls needed):
|
|
python process_chapters.py --all --no-commit
|
|
```
|
|
|
|
This will:
|
|
|
|
1. Create a fresh `infospace.db`
|
|
2. Load all static artifacts (templates, guidelines, VSM reference)
|
|
3. For each chapter whose output files already exist, import them into the
|
|
database and record dependency edges
|
|
4. Skip LLM calls entirely — existing files are detected and reused
|
|
|
|
After regeneration, `--list` and `--stats` work as normal:
|
|
|
|
```bash
|
|
python process_chapters.py --list
|
|
python process_chapters.py --stats
|
|
```
|
|
|
|
---
|
|
|
|
## 8. How the LLM Integration Works
|
|
|
|
The pipeline uses MarkiTect's `markitect.llm` module, which provides three
|
|
adapter backends that implement the `LLMAdapter` interface:
|
|
|
|
| Backend | How it works | Pros | Cons |
|
|
|---------|-------------|------|------|
|
|
| `openrouter` | HTTP POST to OpenRouter API | Fast, real token counts, any model | Needs API key |
|
|
| `claude-code` | Shells out to `claude --print` | No API key needed if CLI installed | Slower, estimated token counts |
|
|
| `gemini` | HTTP POST to Google Generative Language API | Free tier available, real token counts | Rate limits on free tier |
|
|
|
|
### API key setup (OpenRouter)
|
|
|
|
Place your key in one of these locations (checked in order):
|
|
|
|
1. Pass `--api-key` on the command line (not yet implemented in the CLI)
|
|
2. Set `OPENROUTER_API_KEY` environment variable
|
|
3. Create `apikey-openrouter.txt` in the project root (git-ignored)
|
|
|
|
### API key setup (Gemini)
|
|
|
|
Place your Google AI Studio key in one of these locations (checked in order):
|
|
|
|
1. Set `GEMINI_API_KEY` environment variable
|
|
2. Create `apikey-geminifree.txt` in the project root (git-ignored)
|
|
|
|
Get a free API key at https://aistudio.google.com/apikey. The free tier
|
|
supports `gemini-2.5-flash` with generous rate limits.
|
|
|
|
### What happens per stage
|
|
|
|
1. The pipeline **resolves** macro placeholders by looking up artifacts
|
|
in the repository
|
|
2. It **compiles** the template into a complete prompt (macros replaced
|
|
with real content)
|
|
3. It writes the compiled prompt to `output/<stage>/<chapter>-prompt.md`
|
|
for inspection
|
|
4. If no output exists yet and an LLM adapter is configured, it
|
|
**executes** the prompt
|
|
5. **For entity extraction (stage 1):** the pipeline first binds the
|
|
list of already-existing entity slugs to `@{existing_entities}` so
|
|
the LLM knows what to skip. The LLM returns combined content with
|
|
`--- ENTITY: <name> ---` delimiters. The pipeline splits this into
|
|
the **flat canonical directory** (`output/entities/<slug>.md`),
|
|
skipping any entity whose slug already exists. It then generates the
|
|
chapter view file with transclusion directives. The combined content
|
|
is never persisted as a single file — canonical entity files are the
|
|
source of truth.
|
|
6. **For other stages:** the result is written directly to its output file
|
|
7. The output is **stored** as a generated artifact in the repository
|
|
8. Dependency edges are **recorded** in the graph
|
|
|
|
---
|
|
|
|
## 9. Tracking History with Git
|
|
|
|
Every processed chapter produces a git commit containing:
|
|
|
|
- Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent
|
|
- Canonical entity files (`output/entities/<slug>.md`) — one file per entity,
|
|
shared across chapters, first occurrence wins
|
|
- Chapter entity views (`<chapter>-entities.md`) — transclusion into the
|
|
canonical entities relevant to each chapter
|
|
- Generated outputs (`*-mappings.md`, `*-analysis.md`)
|
|
|
|
This means:
|
|
|
|
- `git log` shows the chronological order of processing
|
|
- `git diff` between commits shows what each chapter contributed
|
|
- You can `git bisect` to find where quality degraded
|
|
- You can revert a chapter and re-process it with different settings
|
|
|
|
To let the script auto-commit (default):
|
|
|
|
```bash
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter
|
|
```
|
|
|
|
To commit manually after reviewing:
|
|
|
|
```bash
|
|
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
|
|
# review new entity files in output/entities/ (look for recently modified .md files)
|
|
# review chapter view in output/entities/book-1-chapter-05-entities.md
|
|
git add examples/infospace-with-history/output/
|
|
git commit -m "infospace: process book-1-chapter-05"
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Cost and Performance
|
|
|
|
From our measurements processing chapters 3-5:
|
|
|
|
| | Claude Code CLI | OpenRouter | Gemini (free) |
|
|
|---|---|---|---|
|
|
| Time per chapter | ~5 minutes | ~2 minutes | ~45 seconds |
|
|
| Token counts | Estimated (4 chars/tok) | Real (from API) | Real (from API) |
|
|
| Cost per chapter | ~$0.35 est. | ~$0.07 est. | Free |
|
|
| Default model | claude-sonnet-4 | anthropic/claude-sonnet-4 | gemini-2.5-flash |
|
|
|
|
**Projected cost for all 35 chapters via OpenRouter:** ~$2.50
|
|
(varies by chapter length; Book V chapters are longer).
|
|
|
|
**Gemini free tier** is a zero-cost option for experimentation and
|
|
development. The `gemini-2.5-flash` model produces good results.
|
|
Rate limits apply — the pipeline retries automatically on 429 responses.
|
|
|
|
To reduce costs further, use a cheaper model:
|
|
|
|
```bash
|
|
--provider openrouter --model anthropic/claude-haiku-4-5-20251001
|
|
--provider gemini --model gemini-2.5-flash-lite
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Completing the Remaining Chapters
|
|
|
|
As of now, 5 of 35 chapters are processed (Book I, Chapters 1-5). Here is
|
|
how to complete the rest.
|
|
|
|
### Step-by-step
|
|
|
|
**1. Process remaining Book I chapters (6-11):**
|
|
|
|
```bash
|
|
python process_chapters.py --book 1 --provider openrouter --no-commit
|
|
```
|
|
|
|
Already-processed chapters are skipped (their chapter view files exist).
|
|
Entities from earlier chapters are automatically shared — the LLM is
|
|
told which entities already exist and avoids re-extracting them.
|
|
|
|
**2. Process Books II-V:**
|
|
|
|
```bash
|
|
python process_chapters.py --book 2 --provider openrouter --no-commit
|
|
python process_chapters.py --book 3 --provider openrouter --no-commit
|
|
python process_chapters.py --book 4 --provider openrouter --no-commit
|
|
python process_chapters.py --book 5 --provider openrouter --no-commit
|
|
```
|
|
|
|
Or all at once:
|
|
|
|
```bash
|
|
python process_chapters.py --all --provider openrouter --no-commit
|
|
```
|
|
|
|
**3. Run metrics after each book (or at the end):**
|
|
|
|
```bash
|
|
python process_chapters.py --metrics --provider openrouter --no-commit
|
|
```
|
|
|
|
**4. Commit the results:**
|
|
|
|
```bash
|
|
git add examples/infospace-with-history/output/
|
|
git commit -m "infospace: process all remaining chapters"
|
|
```
|
|
|
|
**5. Review the metrics report:**
|
|
|
|
Open `output/metrics/metrics-report.md`. It will show:
|
|
|
|
- Which VSM concepts (S1-S5, recursion, variety, etc.) now have mappings
|
|
- Total entity and mapping counts
|
|
- Consistency scores
|
|
- Recommendations for gaps
|
|
|
|
### Expected progression
|
|
|
|
| After | Chapters | Expected coverage |
|
|
|-------|----------|-------------------|
|
|
| Book I (11 ch.) | 11/35 | S1, S2, S4 strong; S3 emerging |
|
|
| Books I-II (16 ch.) | 16/35 | S3 (capital control) covered |
|
|
| Books I-III (20 ch.) | 20/35 | Historical patterns add depth |
|
|
| Books I-IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging |
|
|
| All (35 ch.) | 35/35 | Full coverage, S3* and algedonic signals likely from Book V |
|
|
|
|
Book V (public revenue, taxation, sovereign duties) is expected to
|
|
fill the remaining gaps in S3*, S5, and regulatory concepts.
|
|
|
|
---
|
|
|
|
## 12. Quality Improvement Loop
|
|
|
|
The infospace is designed to be **iteratively refined**:
|
|
|
|
1. **Process chapters** — run the pipeline
|
|
2. **Assess metrics** — identify gaps in VSM coverage and consistency
|
|
3. **Refine guidelines** — update `extraction-rules.md` or
|
|
`mapping-rules.md` to address identified weaknesses
|
|
4. **Re-process** — delete output files for specific chapters and re-run
|
|
with improved guidelines
|
|
5. **Compare** — use git diff to see how the refined guidelines changed
|
|
the output
|
|
|
|
Example: if metrics show that S3* (Audit) is consistently missed, you
|
|
could add a paragraph to `extraction-rules.md` explicitly asking the LLM
|
|
to look for audit, inspection, and oversight mechanisms.
|
|
|
|
To re-process a specific chapter, remove its chapter view and downstream
|
|
outputs. Note: canonical entity files in `output/entities/` are shared
|
|
across chapters — only delete individual entity files if you want them
|
|
re-extracted from scratch.
|
|
|
|
```bash
|
|
rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
|
|
rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
|
|
rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
|
|
python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
|
|
```
|
|
|
|
**Important**: never silently delete canonical entity files. If an entity
|
|
is no longer needed, **archive** it with a documented reason:
|
|
|
|
```bash
|
|
# Entity found to be redundant — archive it
|
|
python process_chapters.py --archive-entity extent-of-the-market \
|
|
--reason "Subsumed by market-price and effectual-demand — the concept is fully covered by these two entities"
|
|
|
|
# Then re-process the chapter
|
|
python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
|
|
```
|
|
|
|
If you genuinely need to re-extract an entity with different content
|
|
(e.g., improving its definition), archive the old version first, then
|
|
delete the archive copy only after confirming the new version is better.
|
|
The archive in `output/entities/archive/` preserves the full intellectual
|
|
history of the infospace — every refinement decision is traceable.
|
|
|
|
---
|
|
|
|
## 13. Infrastructure Issues Found and Fixed
|
|
|
|
During development we documented three issues with the MarkiTect
|
|
infrastructure in `INFRA-TASKS.md`:
|
|
|
|
1. **Artifact repo doesn't store content** — the resolver returned
|
|
placeholder text instead of actual artifact content.
|
|
2. **ContentMacro `raw_text` defaults to `""`** — caused silent data
|
|
corruption when macros were constructed programmatically.
|
|
3. **No `@{target}` syntax in MacroParser** — macros had to be
|
|
constructed manually rather than auto-detected from template text.
|
|
|
|
All three have been fixed in the markitect infrastructure. The pipeline
|
|
script (`process_chapters.py`) has been refactored to use the fixed
|
|
infrastructure directly — the local content cache, manual macro
|
|
construction, and manual substitution workarounds have been removed.
|
|
See `INFRA-TASKS.md` for details on each fix.
|
|
|
|
---
|
|
|
|
## 14. Adapting This Pattern to Your Own Project
|
|
|
|
To build your own infospace using this pattern:
|
|
|
|
1. **Choose your source corpus** — any collection of documents you want
|
|
to transform into structured knowledge.
|
|
2. **Define your target ontology** — what concepts, relationships, or
|
|
categories you want to extract (our VSM is just one example).
|
|
3. **Write schemas** — markdown documents defining the required sections
|
|
and validation rules for each output type.
|
|
4. **Write extraction guidelines** — rules that tell the LLM what to
|
|
look for and how to handle edge cases.
|
|
5. **Create prompt templates** — use `@{macro}` syntax to inject source
|
|
text and guidelines at compile time.
|
|
6. **Build your pipeline** — follow `process_chapters.py` as a reference
|
|
for loading artifacts, resolving templates, and calling the LLM.
|
|
7. **Process incrementally** — work through your corpus one document at a
|
|
time, tracking everything in git.
|
|
8. **Measure and refine** — define metrics, assess them periodically,
|
|
and update your guidelines when gaps appear.
|
|
|
|
The key architectural insight is that **schemas and guidelines are
|
|
artifacts** — they live in the same repository as your source text and
|
|
can be versioned, diffed, and refined just like code.
|