diff --git a/examples/infospace-with-history/TUTORIAL.md b/examples/infospace-with-history/TUTORIAL.md index 1c63fe6c..5dd08301 100644 --- a/examples/infospace-with-history/TUTORIAL.md +++ b/examples/infospace-with-history/TUTORIAL.md @@ -1,36 +1,42 @@ # Building an Infospace with History — Tutorial -This tutorial walks through how we built a structured **information space** -(infospace) from Adam Smith's *The Wealth of Nations*, mapping classical -economic concepts to Stafford Beer's **Viable System Model** (VSM), using -MarkiTect's prompt dependency resolution and LLM integration. +This tutorial walks through how to build a structured **infospace** from +Adam Smith's *The Wealth of Nations*, mapping classical economic concepts +to Stafford Beer's **Viable System Model** (VSM), using MarkiTect's +infospace tooling. By the end you will understand how to: -1. Design schemas that scaffold structured LLM output -2. Write prompt templates with dependency injection (`@{macro}` syntax) -3. Populate source artifacts and reference material +1. Declare an infospace with `infospace.yaml` and `markitect infospace init` +2. Design schemas that scaffold structured LLM output +3. Write prompt templates with dependency injection (`@{macro}` syntax) 4. Run an incremental, chapter-by-chapter pipeline -5. Track every change through git history -6. Measure completeness and consistency with metrics -7. Continue the work to process remaining chapters +5. Evaluate entity quality and run collection-level checks +6. Review viability against declared thresholds +7. Track every change through git history +8. Use a completed infospace as a discipline for a new project --- -## 1. The Idea +## 1. What Is an Infospace? -We want to transform a large body of text — the full public-domain text of -*The Wealth of Nations* (5 books, 35 chapters) — into a **curated -collection of economic concepts and entities**, each mapped to the VSM. +An **infospace** is a curated, self-describing collection of **entities** +(concepts, mechanisms, observations) that together explain a **topic** +through the lens of one or more **disciplines**. -The challenge: this is too much for a single prompt. The text is hundreds of -thousands of words. We need to work **incrementally**, one chapter at a time, -building up the infospace and tracking progress. +| Term | This example | +|---|---| +| Topic | *The Wealth of Nations* (Smith, 1776) | +| Discipline | Viable System Model (Beer) | +| Entities | Economic concepts: division of labour, natural price, … | +| Viability | Does the entity set answer the competency questions? | -MarkiTect's prompt dependency resolution lets us define **templates** with -`@{placeholder}` macros that are filled from an artifact repository at -execution time. The pipeline compiles each template into a complete prompt, -sends it to an LLM, and stores the output — all tracked by git. +The challenge with a large source corpus is that it is too big for a single +prompt. MarkiTect processes it **incrementally**, one chapter at a time, +building up the entity set and tracking progress through git. + +An infospace is **viable** when it meets threshold scores across defined +metrics — it is fit for purpose as an explanatory tool. --- @@ -39,10 +45,11 @@ sends it to an LLM, and stores the output — all tracked by git. ``` examples/infospace-with-history/ │ -├── README.md # Project brief +├── infospace.yaml # Declarative infospace configuration (NEW) +├── README.md ├── TUTORIAL.md # This file ├── INFRA-TASKS.md # Infrastructure issues found during the experiment -├── process_chapters.py # Pipeline script +├── process_chapters.py # Pipeline script (chapter processing) ├── infospace.db # SQLite artifact database (generated, not in git) │ ├── schemas/ # Output structure definitions @@ -66,68 +73,113 @@ examples/infospace-with-history/ ├── entities/ # Flat canonical entity set + chapter views │ ├── division-of-labour.md # Canonical entity file (PRIMARY) │ ├── exchange.md - │ ├── commercial-society.md - │ ├── ... │ ├── book-1-chapter-01-entities.md # Chapter view (transclusion) - │ ├── book-1-chapter-01-prompt.md # Compiled prompt - │ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md │ └── ... ├── mappings/ # Per-chapter VSM mappings ├── analyses/ # Per-chapter synthesised analyses - └── metrics/ # Cross-chapter metrics reports + └── metrics/ # Collection metrics + history + ├── metrics.yaml # Latest metric values + └── history.yaml # Timestamped snapshot log ``` **Entity organisation**: The infospace maintains a **flat canonical set** -of entities — one markdown file per entity, stored directly in -`output/entities/`. When a chapter mentions an entity that already exists -(detected by slug collision), the duplicate is skipped and the original -definition is kept. This builds a **minimal necessary and sufficient set** -of entities across the entire book. - -Per-chapter `*-entities.md` files are **secondary views** that use -MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose -entity content by reference. The same entity (e.g., `division-of-labour.md`) -can appear in multiple chapter views. Editing a canonical entity file -automatically updates every chapter view that references it. - -**Deduplication**: The pipeline tells the LLM which entities already exist -(via the `@{existing_entities}` macro in the extraction template) so it -focuses on genuinely new entities. At the file level, slug collisions -are detected and skipped as a safety net. - -**Entity lifecycle**: Once an entity enters the canonical set, it is -**never silently deleted**. Entities may only be retired when they have been -subsumed by another entity, found to partially map onto other entities, or -otherwise determined to be redundant. Retired entities are **archived** — -moved to `output/entities/archive/` with a dated header documenting the -reason. This preserves the intellectual history of the infospace: every -decision to drop an entity is a deliberate, documented learning. - -```bash -# Archive an entity that has been subsumed by another -python process_chapters.py --archive-entity enlarged-monopoly \ - --reason "Subsumed by monopoly-price — both describe the same market distortion" - -# The archived file retains its full content with an explanatory header -cat output/entities/archive/enlarged-monopoly.md -``` - -The `--list` command shows both the active canonical set and the archive count. +of entities — one markdown file per entity in `output/entities/`. Duplicate +slugs across chapters are skipped (first occurrence wins). Per-chapter +`*-entities.md` files are **secondary views** using transclusion directives +(`{{ include "entity.md" }}`), so editing a canonical file updates every +chapter view that references it. --- -## 3. Designing Schemas +## 3. Initialising an Infospace -Before writing any prompts we defined **four schemas** that tell the LLM -exactly what sections each output document must contain. This ensures every -generated document is machine-parseable and comparable across chapters. +### Starting fresh + +Use `markitect infospace init` to create an `infospace.yaml`: + +```bash +cd my-new-infospace/ +markitect infospace init \ + --topic "The Wealth of Nations" \ + --domain "Classical Economics" \ + --sources artifacts/sources/ \ + --discipline "Viable System Model" +``` + +This creates a minimal `infospace.yaml`. Edit it to add schemas, +competency questions, and viability thresholds: + +```yaml +topic: + name: "The Wealth of Nations" + domain: "Classical Economics" + sources: artifacts/sources/ + +disciplines: + - name: "Viable System Model" + path: artifacts/vsm-reference/ + +schemas: + entity: schemas/economic-entity-schema-v1.0.md + mapping: schemas/vsm-mapping-schema-v1.0.md + analysis: schemas/chapter-analysis-schema-v1.0.md + +competency_questions: | + 1. How does Smith's division of labour map to VSM System 1 operations? + 2. What mechanisms in WoN correspond to VSM coordination (System 2)? + 3. Where does Smith describe self-organising regulation (System 3)? + 4. What role does the "invisible hand" play as a System 4 mechanism? + 5. How do Smith's views on government map to System 5 policy? + 6. Is the WoN entity set viable as an explanatory framework? + +viability: + redundancy_ratio: { max: 0.10 } + coverage_ratio: { min: 0.50 } + coherence_components: { max: 3 } + consistency_cycles: { max: 0 } + granularity_entropy: { min: 1.0 } + +pipeline: + stages: + - name: extract-entities + template: templates/extract-entities.md + - name: map-to-vsm + template: templates/map-to-vsm.md + - name: synthesize-analysis + template: templates/synthesize-analysis.md +``` + +### Checking status + +At any point, inspect the infospace: + +```bash +markitect infospace status +# Infospace: The Wealth of Nations +# Domain: Classical Economics +# Entities: 109 +# Domains: Production, Distribution, Exchange, Regulation +# Disciplines: Viable System Model +# Chapters: 9/35 processed + +markitect infospace entities +# Lists all entities with domain, source chapter, word count +``` + +--- + +## 4. Designing Schemas + +Before writing any prompts, define **schemas** — markdown documents that +tell the LLM exactly what sections each output must contain. Schemas are +not code; the LLM reads them as instructions. ### Economic Entity Schema (`schemas/economic-entity-schema-v1.0.md`) Every extracted entity must have: -- **H1 heading** with the entity name -- **Definition** (20-150 words) +- **H1 heading** with the entity name (title case) +- **Definition** (20–150 words, precise and non-circular) - **Source Chapter** citing Book and Chapter - **Context** — where in Smith's argument the entity appears - **Economic Domain** (Production, Distribution, Exchange, etc.) @@ -138,7 +190,7 @@ Optional: Smith's Original Wording, Modern Interpretation. Every entity-to-VSM mapping must have: -- **H1 heading** in the format `Entity Name -> VSM Concept Name` +- **H1 heading**: `Entity Name -> VSM Concept Name` - **Economic Entity Reference** and **VSM Concept Reference** - **Mapping Rationale** (minimum 30 words, grounded in Beer's definitions) - **Mapping Strength**: Strong, Moderate, or Weak @@ -147,32 +199,22 @@ Every entity-to-VSM mapping must have: The per-chapter synthesis includes: -- **Chapter Summary** (50-300 words) +- **Chapter Summary** (50–300 words) - **Entities Extracted** — bulleted list - **VSM Mappings** — entity, concept, strength - **VSM Coverage** — explicit assessment of S1 through S5 and S3* - **Gaps & Observations** -### Metrics Schema (implicit in `assess-metrics` template) - -The metrics report computes: - -- VSM Concept Coverage (% of S1-S5, recursion, variety, etc.) -- Chapter Coverage (% of 35 chapters processed) -- Entity and Mapping counts -- Terminology Consistency and Cross-reference Integrity scores - -**Key insight**: Schemas are not code — they are markdown documents that -the LLM reads as instructions. This means you can iterate on them without -changing any infrastructure. +**Key insight**: Schemas are artifacts — they live in the repository and +can be versioned, diffed, and refined just like code. Improving a schema +and re-processing a chapter is visible as a git diff. --- -## 4. Writing Prompt Templates +## 5. Writing Prompt Templates -Each template is a markdown file containing instructions for the LLM plus -`@{macro_name}` placeholders that MarkiTect's resolver fills with artifact -content at compile time. +Each template is a markdown file with `@{macro_name}` placeholders that +MarkiTect's resolver fills with artifact content at compile time. ### Template 1: Extract Entities (`templates/extract-entities.md`) @@ -184,50 +226,36 @@ Your task is to extract distinct economic entities from a chapter of Adam Smith's *The Wealth of Nations*. ## Source Chapter - @{chapter_text} ## Extraction Guidelines - @{extraction_rules} ## VSM Framework Context - @{vsm_framework} ## Existing Entities - @{existing_entities} -## Instructions -[... detailed step-by-step instructions ...] - ## Output Format - -Output each entity as a separate markdown document, delimited by -`--- ENTITY: ---` markers. +Output each entity delimited by `--- ENTITY: ---` markers. ``` -The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`, -`existing_entities`) are resolved by looking up artifacts by name in -the relevant information spaces. The `existing_entities` list is -dynamically generated at runtime from the canonical entity files -already on disk, enabling incremental extraction without duplication. +The `@{existing_entities}` macro is generated at runtime from canonical +files already on disk, enabling incremental extraction without duplication. ### Template 2: Map to VSM (`templates/map-to-vsm.md`) -Takes `@{entities}` (output from stage 1), `@{vsm_framework}`, and -`@{mapping_rules}` as inputs. +Inputs: `@{entities}`, `@{vsm_framework}`, `@{mapping_rules}`. ### Template 3: Synthesise Analysis (`templates/synthesize-analysis.md`) -Takes `@{chapter_text}`, `@{entities}` (stage 1 output), -`@{mappings}` (stage 2 output), and `@{vsm_framework}`. +Inputs: `@{chapter_text}`, `@{entities}`, `@{mappings}`, `@{vsm_framework}`. ### Template 4: Assess Metrics (`templates/assess-metrics.md`) -Takes `@{all_analyses}` (concatenation of all chapter analyses) and -`@{vsm_framework}`. Runs across the entire infospace, not per-chapter. +Inputs: `@{all_analyses}` (all chapter analyses concatenated), `@{vsm_framework}`. +Runs across the entire infospace, not per-chapter. **Dependency chain per chapter:** @@ -247,95 +275,63 @@ vsm_framework ────┤ ▼ analysis ``` -After all chapters are processed, `assess-metrics` evaluates the -complete infospace. - --- -## 5. Populating Artifacts +## 6. Populating Artifacts ### Source chapters (`artifacts/sources/`) -35 markdown files containing the full public-domain text of each chapter. -Named by convention: `book-1-chapter-01.md` through `book-5-chapter-03.md`, -plus `introduction.md`. - -These are loaded into the `infospace-sources` information space. +35 markdown files with the full public-domain text of each chapter. +Named `book-1-chapter-01.md` through `book-5-chapter-03.md`. ### Guidelines (`artifacts/guidelines/`) -Two hand-written reference documents: - -- **`extraction-rules.md`** — What constitutes an entity, granularity rules, - naming conventions, quality checks. +- **`extraction-rules.md`** — What constitutes an entity, granularity + rules, naming conventions. - **`mapping-rules.md`** — How to map entities to VSM systems, what - constitutes strong/moderate/weak mapping strength. - -These are loaded into `infospace-guidelines`. + constitutes Strong/Moderate/Weak strength. ### VSM reference (`artifacts/vsm-reference/`) -- **`vsm-framework.md`** — Complete description of Beer's VSM (S1-S5, S3*, - recursion, variety, viability, attenuation/amplification, algedonic - signals, autonomy). Includes economic interpretations for each system. - -Loaded into `infospace-vsm-reference`. +- **`vsm-framework.md`** — Complete description of Beer's VSM (S1–S5, + S3*, recursion, variety, viability, algedonic signals, autonomy) with + economic interpretations. --- -## 6. The Pipeline Script +## 7. Processing Chapters -`process_chapters.py` orchestrates everything. It: +`process_chapters.py` orchestrates the three-stage pipeline. It initialises +the artifact repository, loads static artifacts, runs entity extraction → +VSM mapping → analysis synthesis, and commits each chapter to git. -1. Initialises the artifact repository (SQLite) and information spaces -2. Loads all static artifacts (templates, guidelines, VSM reference) -3. For each chapter, runs the three-stage pipeline -4. Optionally calls an LLM to auto-generate outputs -5. Records dependency edges in the graph -6. Commits results to git - -### Running a single chapter +### Single chapter ```bash -# Manual mode (writes prompts, waits for you to provide output files): +# Manual mode (writes prompts, awaits output files): python process_chapters.py --chapter book-1-chapter-05 --no-commit -# Automatic mode via OpenRouter (recommended — fast, real token counts): -python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit +# Auto mode via OpenRouter (free models available): +python process_chapters.py --chapter book-1-chapter-05 --provider openrouter -# Automatic mode via Gemini free tier: -python process_chapters.py --chapter book-1-chapter-05 --provider gemini --no-commit - -# Automatic mode via Claude Code CLI: -python process_chapters.py --chapter book-1-chapter-05 --provider claude-code --no-commit - -# With a specific model: -python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --model anthropic/claude-haiku-4-5-20251001 --no-commit -python process_chapters.py --chapter book-1-chapter-05 --provider gemini --model gemini-2.5-flash-lite --no-commit +# With a specific free model: +python process_chapters.py --chapter book-1-chapter-05 \ + --provider openrouter --model meta-llama/llama-4-maverick:free ``` -### Running a whole book +### Whole book or all chapters ```bash -python process_chapters.py --book 1 --provider openrouter --no-commit +python process_chapters.py --book 1 --provider openrouter +python process_chapters.py --all --provider openrouter ``` -### Running all chapters - -```bash -python process_chapters.py --all --provider openrouter --no-commit -``` - -### Checking progress +### Check progress ```bash python process_chapters.py --list ``` -Prints a table showing which chapters have completed each stage -(entity counts reflect the chapter view's transclusion references, -including shared entities from earlier chapters): - ``` Available chapters (35): @@ -343,142 +339,141 @@ Available chapters (35): ------------------------------ ------------ ------------ ------------ book-1-chapter-01 done (13) done done book-1-chapter-02 done (7) done done - book-1-chapter-03 done (18) done done - book-1-chapter-04 done (5) done done - book-1-chapter-05 done (1) done done ... - Canonical entity set: 42 unique entities + Canonical entity set: 109 unique entities ``` -### Assessing metrics +### Entity lifecycle -After processing a batch of chapters, run the metrics assessment: +Entities in the canonical set are **never silently deleted**. Retire +an entity by archiving it with a documented reason: ```bash -python process_chapters.py --metrics --provider openrouter --no-commit +python process_chapters.py --archive-entity enlarged-monopoly \ + --reason "Subsumed by monopoly-price — same market distortion" ``` -This concatenates all completed analyses and asks the LLM to evaluate -coverage, consistency, and completeness. +The archived file moves to `output/entities/archive/.md` with a +dated header, preserving the intellectual history of every decision. -### Dependency statistics +--- + +## 8. Evaluating Entity Quality + +Once chapters are processed, evaluate the entity set using the infospace +tooling commands. + +### Per-entity evaluation ```bash -python process_chapters.py --stats +# Evaluate all entities (requires LLM provider): +markitect infospace evaluate --provider openrouter + +# Evaluate entities from a specific chapter: +markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter + +# Re-evaluate a single entity: +markitect infospace evaluate --entity division-of-labour --provider openrouter +``` + +This runs the `evaluate-entity` prompt template against each entity, +scoring dimensions like definition precision, source grounding, and +VSM relevance. Results are written to `output/evaluations/`. + +### Collection-level checks (C1–C5) + +```bash +# Run all five collection checks: +markitect infospace check --provider openrouter + +# Run individual checks: +markitect infospace check redundancy # C1: Are any entities synonymous? +markitect infospace check coverage # C2: Which domain × VSM cells are empty? +markitect infospace check coherence # C3: Is the entity graph well-connected? +markitect infospace check consistency # C4: Are there circular definitions? +markitect infospace check granularity # C5: Is abstraction level balanced? +``` + +Each check uses the platform's embedding, graph analysis, and FCA +infrastructure. Results are written to `output/metrics/` and a new +snapshot is appended to `metrics-history.yaml`. + +Sample output: + +``` +Running collection checks on 109 entities... + + C1 — redundancy + redundancy_ratio: 0.0183 + high_similarity_pairs: 2 + + C2 — coverage + coverage_ratio: 0.4286 + empty_cells: [['Regulation', 'S3*'], ['Historical', 'S5']] + + C3 — coherence + coherence_components: 1 + modularity: 0.412 + + C4 — consistency + consistency_cycles: 0 + grounding_ratio: 0.94 + + C5 — granularity + granularity_entropy: 2.69 ``` --- -## 7. The Artifact Database (`infospace.db`) - -The pipeline stores all artifacts (source text, templates, guidelines, generated -outputs) and their dependency edges in a local SQLite database — -`infospace.db`. This file is **not checked into git** because it is a derived -cache that can be regenerated deterministically from the files already in the -repository. - -### Why it is excluded - -- **Binary format** — SQLite databases don't produce meaningful diffs and - would bloat the git history with every pipeline run. -- **Fully derived** — every piece of data in the database originates from - markdown files that *are* tracked in git (sources, templates, schemas, - guidelines, and generated output). -- **Reproducible** — re-running the pipeline rebuilds the database from - scratch without any LLM calls, because each stage checks for existing - output files on disk before invoking the LLM. - -### How to regenerate it - -If `infospace.db` is missing (e.g. after a fresh clone), rebuild it by -re-running the pipeline over the chapters that already have output on disk: +## 9. Reviewing Viability ```bash -# Regenerate the database from existing output files (no LLM calls needed): -python process_chapters.py --all --no-commit +markitect infospace viability ``` -This will: +Compares the latest metrics against the thresholds declared in +`infospace.yaml`: -1. Create a fresh `infospace.db` -2. Load all static artifacts (templates, guidelines, VSM reference) -3. For each chapter whose output files already exist, import them into the - database and record dependency edges -4. Skip LLM calls entirely — existing files are detected and reused +``` +Metric Value Threshold Status +----------------------------------------------------------- +redundancy_ratio 0.0183 max=0.10 PASS +coverage_ratio 0.4286 min=0.50 FAIL +coherence_components 1 max=3 PASS +consistency_cycles 0 max=0 PASS +granularity_entropy 2.6900 min=1.0 PASS -After regeneration, `--list` and `--stats` work as normal: +Viable: NO (4/5 thresholds met) +``` + +Coverage is currently failing (42% < 50% threshold) because only 9 of +35 chapters have been processed. Once more chapters are done, coverage +will rise. + +### Metrics history ```bash -python process_chapters.py --list -python process_chapters.py --stats +markitect infospace history +``` + +Shows how metrics evolved across runs: + +``` +Snapshot Date Entities coverage redundancy entropy +------------------------------------------------------------- +6ba48eb2 2026-02-19 85 0.361 0.000 2.687 ``` --- -## 8. How the LLM Integration Works +## 10. Tracking History with Git -The pipeline uses MarkiTect's `markitect.llm` module, which provides three -adapter backends that implement the `LLMAdapter` interface: +Every processed chapter produces one git commit containing: -| Backend | How it works | Pros | Cons | -|---------|-------------|------|------| -| `openrouter` | HTTP POST to OpenRouter API | Fast, real token counts, any model | Needs API key | -| `claude-code` | Shells out to `claude --print` | No API key needed if CLI installed | Slower, estimated token counts | -| `gemini` | HTTP POST to Google Generative Language API | Free tier available, real token counts | Rate limits on free tier | - -### API key setup (OpenRouter) - -Place your key in one of these locations (checked in order): - -1. Pass `--api-key` on the command line (not yet implemented in the CLI) -2. Set `OPENROUTER_API_KEY` environment variable -3. Create `apikey-openrouter.txt` in the project root (git-ignored) - -### API key setup (Gemini) - -Place your Google AI Studio key in one of these locations (checked in order): - -1. Set `GEMINI_API_KEY` environment variable -2. Create `apikey-geminifree.txt` in the project root (git-ignored) - -Get a free API key at https://aistudio.google.com/apikey. The free tier -supports `gemini-2.5-flash` with generous rate limits. - -### What happens per stage - -1. The pipeline **resolves** macro placeholders by looking up artifacts - in the repository -2. It **compiles** the template into a complete prompt (macros replaced - with real content) -3. It writes the compiled prompt to `output//-prompt.md` - for inspection -4. If no output exists yet and an LLM adapter is configured, it - **executes** the prompt -5. **For entity extraction (stage 1):** the pipeline first binds the - list of already-existing entity slugs to `@{existing_entities}` so - the LLM knows what to skip. The LLM returns combined content with - `--- ENTITY: ---` delimiters. The pipeline splits this into - the **flat canonical directory** (`output/entities/.md`), - skipping any entity whose slug already exists. It then generates the - chapter view file with transclusion directives. The combined content - is never persisted as a single file — canonical entity files are the - source of truth. -6. **For other stages:** the result is written directly to its output file -7. The output is **stored** as a generated artifact in the repository -8. Dependency edges are **recorded** in the graph - ---- - -## 9. Tracking History with Git - -Every processed chapter produces a git commit containing: - -- Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent -- Canonical entity files (`output/entities/.md`) — one file per entity, - shared across chapters, first occurrence wins -- Chapter entity views (`-entities.md`) — transclusion into the - canonical entities relevant to each chapter +- Compiled prompts (`*-prompt.md`) — audit what was sent to the LLM +- Canonical entity files (`output/entities/.md`) — first occurrence wins +- Chapter entity views (`-entities.md`) — transclusion references - Generated outputs (`*-mappings.md`, `*-analysis.md`) This means: @@ -486,212 +481,196 @@ This means: - `git log` shows the chronological order of processing - `git diff` between commits shows what each chapter contributed - You can `git bisect` to find where quality degraded -- You can revert a chapter and re-process it with different settings +- You can revert a chapter and re-process with improved guidelines -To let the script auto-commit (default): - -```bash -python process_chapters.py --chapter book-1-chapter-05 --provider openrouter -``` +The `clean-example-history` branch in this repository demonstrates the +intended structure: each chapter is a single, self-contained commit. +Use it as a reference for how the infospace grew step by step. To commit manually after reviewing: ```bash python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit -# review new entity files in output/entities/ (look for recently modified .md files) -# review chapter view in output/entities/book-1-chapter-05-entities.md +# review output/entities/ and output/mappings/ git add examples/infospace-with-history/output/ git commit -m "infospace: process book-1-chapter-05" ``` --- -## 10. Cost and Performance +## 11. Cost and Performance -From our measurements processing chapters 3-5: - -| | Claude Code CLI | OpenRouter | Gemini (free) | +| | OpenRouter (free) | OpenRouter (paid) | Gemini (free) | |---|---|---|---| -| Time per chapter | ~5 minutes | ~2 minutes | ~45 seconds | -| Token counts | Estimated (4 chars/tok) | Real (from API) | Real (from API) | -| Cost per chapter | ~$0.35 est. | ~$0.07 est. | Free | -| Default model | claude-sonnet-4 | anthropic/claude-sonnet-4 | gemini-2.5-flash | +| Time per chapter | ~5 min | ~2 min | ~45 sec | +| Cost per chapter | $0.00 | ~$0.07 | $0.00 | +| Default model | `arcee-ai/trinity-large-preview:free` | `anthropic/claude-sonnet-4` | `gemini-2.5-flash` | +| Rate limits | ~200 req/day | High | Per-minute | -**Projected cost for all 35 chapters via OpenRouter:** ~$2.50 -(varies by chapter length; Book V chapters are longer). - -**Gemini free tier** is a zero-cost option for experimentation and -development. The `gemini-2.5-flash` model produces good results. -Rate limits apply — the pipeline retries automatically on 429 responses. - -To reduce costs further, use a cheaper model: +**OpenRouter free tier**: Sign up at [openrouter.ai](https://openrouter.ai) +(no credit card required). Store your key in `apikey-openrouter.txt` in the +project root (git-ignored), or set `OPENROUTER_API_KEY`. ```bash ---provider openrouter --model anthropic/claude-haiku-4-5-20251001 ---provider gemini --model gemini-2.5-flash-lite +export OPENROUTER_API_KEY=$(cat apikey-openrouter.txt | tr -d '[:space:]') ``` +Use `openrouter/free` to automatically select from whichever free model is +available: + +```bash +python process_chapters.py --chapter book-1-chapter-05 \ + --provider openrouter --model openrouter/free +``` + +**Gemini free tier**: Get a key at [aistudio.google.com/apikey](https://aistudio.google.com/apikey), +store in `apikey-geminifree.txt`. + +Note: The `claude-code` provider (Claude CLI subprocess) is not available +when running inside a Claude Code session due to nested session restrictions. + --- -## 11. Completing the Remaining Chapters +## 12. Completing the Remaining Chapters -As of now, 5 of 35 chapters are processed (Book I, Chapters 1-5). Here is -how to complete the rest. +As of writing, 9 of 35 chapters are processed (Book I, Chapters 1–9). -### Step-by-step - -**1. Process remaining Book I chapters (6-11):** +**Process Book I remainder:** ```bash -python process_chapters.py --book 1 --provider openrouter --no-commit +export OPENROUTER_API_KEY=$(cat apikey-openrouter.txt | tr -d '[:space:]') +git checkout clean-example-history +python process_chapters.py --book 1 --provider openrouter ``` -Already-processed chapters are skipped (their chapter view files exist). -Entities from earlier chapters are automatically shared — the LLM is -told which entities already exist and avoids re-extracting them. +Already-processed chapters are skipped — their chapter view files exist. +The `@{existing_entities}` macro ensures the LLM only extracts genuinely +new entities. -**2. Process Books II-V:** +**Process Books II–V:** ```bash -python process_chapters.py --book 2 --provider openrouter --no-commit -python process_chapters.py --book 3 --provider openrouter --no-commit -python process_chapters.py --book 4 --provider openrouter --no-commit -python process_chapters.py --book 5 --provider openrouter --no-commit +python process_chapters.py --book 2 --provider openrouter +python process_chapters.py --book 3 --provider openrouter +python process_chapters.py --book 4 --provider openrouter +python process_chapters.py --book 5 --provider openrouter ``` -Or all at once: +**Run collection checks after each book:** ```bash -python process_chapters.py --all --provider openrouter --no-commit +markitect infospace check --provider openrouter +markitect infospace viability ``` -**3. Run metrics after each book (or at the end):** - -```bash -python process_chapters.py --metrics --provider openrouter --no-commit -``` - -**4. Commit the results:** - -```bash -git add examples/infospace-with-history/output/ -git commit -m "infospace: process all remaining chapters" -``` - -**5. Review the metrics report:** - -Open `output/metrics/metrics-report.md`. It will show: - -- Which VSM concepts (S1-S5, recursion, variety, etc.) now have mappings -- Total entity and mapping counts -- Consistency scores -- Recommendations for gaps - -### Expected progression +**Expected progression:** | After | Chapters | Expected coverage | |-------|----------|-------------------| | Book I (11 ch.) | 11/35 | S1, S2, S4 strong; S3 emerging | -| Books I-II (16 ch.) | 16/35 | S3 (capital control) covered | -| Books I-III (20 ch.) | 20/35 | Historical patterns add depth | -| Books I-IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging | -| All (35 ch.) | 35/35 | Full coverage, S3* and algedonic signals likely from Book V | - -Book V (public revenue, taxation, sovereign duties) is expected to -fill the remaining gaps in S3*, S5, and regulatory concepts. +| Books I–II (16 ch.) | 16/35 | S3 (capital control) covered | +| Books I–III (20 ch.) | 20/35 | Historical patterns add depth | +| Books I–IV (30 ch.) | 30/35 | S5 (policy, mercantilism) emerging | +| All (35 ch.) | 35/35 | Full coverage; S3* and algedonic signals from Book V | --- -## 12. Quality Improvement Loop +## 13. Using the Infospace as a Discipline + +A completed, viable infospace can itself become a **discipline** — a lens +applied to a new topic. For example, the Wealth of Nations infospace could +be applied to analyse a modern supply chain. + +```bash +# In a new infospace directory: +markitect infospace init \ + --topic "Modern Supply Chain Management" \ + --domain "Operations Research" \ + --discipline "Wealth of Nations" + +# Bind the WoN infospace as a discipline: +markitect infospace bind-discipline ../infospace-with-history + +# List bound disciplines and their viability: +markitect infospace disciplines +# Viable System Model PASS (from vsm-reference/) +# Wealth of Nations PASS (from ../infospace-with-history) + +# Check for stale mappings after discipline update: +markitect infospace stale-mappings +``` + +The discipline infospace must be viable (meeting its own thresholds) +before it can be used as a lens. If the discipline's entities change, +dependent mappings are flagged for re-evaluation. + +--- + +## 14. Quality Improvement Loop The infospace is designed to be **iteratively refined**: 1. **Process chapters** — run the pipeline -2. **Assess metrics** — identify gaps in VSM coverage and consistency -3. **Refine guidelines** — update `extraction-rules.md` or +2. **Evaluate** — `markitect infospace evaluate --provider openrouter` +3. **Check** — `markitect infospace check --provider openrouter` +4. **Review viability** — `markitect infospace viability` +5. **Refine guidelines** — update `extraction-rules.md` or `mapping-rules.md` to address identified weaknesses -4. **Re-process** — delete output files for specific chapters and re-run - with improved guidelines -5. **Compare** — use git diff to see how the refined guidelines changed - the output +6. **Re-process** — delete output files for specific chapters and re-run +7. **Compare** — `git diff` shows how refined guidelines changed the output -Example: if metrics show that S3* (Audit) is consistently missed, you -could add a paragraph to `extraction-rules.md` explicitly asking the LLM -to look for audit, inspection, and oversight mechanisms. +Example: if checks show S3* (Audit) is consistently missing, add a +paragraph to `extraction-rules.md` explicitly asking the LLM to look for +audit, inspection, and oversight mechanisms. -To re-process a specific chapter, remove its chapter view and downstream -outputs. Note: canonical entity files in `output/entities/` are shared -across chapters — only delete individual entity files if you want them -re-extracted from scratch. +To re-process a specific chapter: ```bash -rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md -rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md -rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md -python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit +rm -f output/entities/book-1-chapter-03-entities.md +rm -f output/mappings/book-1-chapter-03-mappings.md +rm -f output/analyses/book-1-chapter-03-analysis.md +python process_chapters.py --chapter book-1-chapter-03 --provider openrouter ``` -**Important**: never silently delete canonical entity files. If an entity -is no longer needed, **archive** it with a documented reason: +Never silently delete canonical entity files. Archive them instead: ```bash -# Entity found to be redundant — archive it python process_chapters.py --archive-entity extent-of-the-market \ - --reason "Subsumed by market-price and effectual-demand — the concept is fully covered by these two entities" - -# Then re-process the chapter -python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit + --reason "Subsumed by market-price and effectual-demand" +python process_chapters.py --chapter book-1-chapter-03 --provider openrouter ``` -If you genuinely need to re-extract an entity with different content -(e.g., improving its definition), archive the old version first, then -delete the archive copy only after confirming the new version is better. -The archive in `output/entities/archive/` preserves the full intellectual -history of the infospace — every refinement decision is traceable. +--- + +## 15. The Artifact Database (`infospace.db`) + +The pipeline stores all artifacts and dependency edges in a local SQLite +database — `infospace.db`. This file is **not committed to git** because +it is fully derived from the markdown files that are tracked. + +To regenerate it after a fresh clone (no LLM calls needed): + +```bash +python process_chapters.py --all --no-commit +``` --- -## 13. Infrastructure Issues Found and Fixed +## 16. Adapting This Pattern to Your Own Project -During development we documented three issues with the MarkiTect -infrastructure in `INFRA-TASKS.md`: +To build your own infospace: -1. **Artifact repo doesn't store content** — the resolver returned - placeholder text instead of actual artifact content. -2. **ContentMacro `raw_text` defaults to `""`** — caused silent data - corruption when macros were constructed programmatically. -3. **No `@{target}` syntax in MacroParser** — macros had to be - constructed manually rather than auto-detected from template text. +1. `markitect infospace init --topic "..." --domain "..." --discipline "..."` +2. Write schemas defining required sections for each output type +3. Write extraction guidelines that tell the LLM what to look for +4. Create prompt templates using `@{macro}` syntax +5. Populate `artifacts/sources/` with your source corpus +6. Run `process_chapters.py` (or your equivalent pipeline script) +7. Evaluate with `markitect infospace evaluate` and `check` +8. Review `markitect infospace viability` against your thresholds +9. Iterate: refine guidelines, re-process, re-evaluate +10. Once viable, use as a discipline for a new infospace -All three have been fixed in the markitect infrastructure. The pipeline -script (`process_chapters.py`) has been refactored to use the fixed -infrastructure directly — the local content cache, manual macro -construction, and manual substitution workarounds have been removed. -See `INFRA-TASKS.md` for details on each fix. - ---- - -## 14. Adapting This Pattern to Your Own Project - -To build your own infospace using this pattern: - -1. **Choose your source corpus** — any collection of documents you want - to transform into structured knowledge. -2. **Define your target ontology** — what concepts, relationships, or - categories you want to extract (our VSM is just one example). -3. **Write schemas** — markdown documents defining the required sections - and validation rules for each output type. -4. **Write extraction guidelines** — rules that tell the LLM what to - look for and how to handle edge cases. -5. **Create prompt templates** — use `@{macro}` syntax to inject source - text and guidelines at compile time. -6. **Build your pipeline** — follow `process_chapters.py` as a reference - for loading artifacts, resolving templates, and calling the LLM. -7. **Process incrementally** — work through your corpus one document at a - time, tracking everything in git. -8. **Measure and refine** — define metrics, assess them periodically, - and update your guidelines when gaps appear. - -The key architectural insight is that **schemas and guidelines are -artifacts** — they live in the same repository as your source text and -can be versioned, diffed, and refined just like code. +The key insight is that **schemas and guidelines are artifacts** — they +live in the repository and can be versioned and diffed just like code. +Every refinement decision is traceable through git history.