feat(infospace): flat canonical entity set with cross-chapter deduplication

Restructure entity storage from per-chapter subdirectories to a flat canonical set in output/entities/. Each entity exists as a single file; duplicates across chapters are detected by slug collision and skipped (first occurrence wins). Chapter views use {{ include }} transclusion to reference shared entity files. Add @{existing_entities} macro to extract-entities template so the LLM knows which entities already exist and focuses on genuinely new ones. Refactor _call_llm() from _execute_llm() for callers that handle their own file I/O. 41 unique entities from 4 chapters (2 duplicates removed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 22:24:20 +01:00
parent 706981c39f
commit 2d1282a61e
52 changed files with 1738 additions and 1376 deletions
--- a/examples/infospace-with-history/TUTORIAL.md
+++ b/examples/infospace-with-history/TUTORIAL.md
@@ -62,12 +62,38 @@ examples/infospace-with-history/
 │   └── vsm-reference/         # VSM framework definition
 │
 └── output/                     # Generated artifacts (LLM outputs)
-    ├── entities/               # Per-chapter entity extractions
+    ├── entities/               # Flat canonical entity set + chapter views
+    │   ├── division-of-labour.md        # Canonical entity file (PRIMARY)
+    │   ├── exchange.md
+    │   ├── commercial-society.md
+    │   ├── ...
+    │   ├── book-1-chapter-01-entities.md  # Chapter view (transclusion)
+    │   ├── book-1-chapter-01-prompt.md   # Compiled prompt
+    │   ├── book-1-chapter-04-entities.md  # Also references division-of-labour.md
+    │   └── ...
    ├── mappings/               # Per-chapter VSM mappings
    ├── analyses/               # Per-chapter synthesised analyses
    └── metrics/                # Cross-chapter metrics reports
 ```

+**Entity organisation**: The infospace maintains a **flat canonical set**
+of entities — one markdown file per entity, stored directly in
+`output/entities/`. When a chapter mentions an entity that already exists
+(detected by slug collision), the duplicate is skipped and the original
+definition is kept. This builds a **minimal necessary and sufficient set**
+of entities across the entire book.
+
+Per-chapter `*-entities.md` files are **secondary views** that use
+MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose
+entity content by reference. The same entity (e.g., `division-of-labour.md`)
+can appear in multiple chapter views. Editing a canonical entity file
+automatically updates every chapter view that references it.
+
+**Deduplication**: The pipeline tells the LLM which entities already exist
+(via the `@{existing_entities}` macro in the extraction template) so it
+focuses on genuinely new entities. At the file level, slug collisions
+are detected and skipped as a safety net.
+
 ---

 ## 3. Designing Schemas
@@ -149,6 +175,10 @@ Adam Smith's *The Wealth of Nations*.

@{vsm_framework}

+## Existing Entities
+
+@{existing_entities}
+
 ## Instructions
 [... detailed step-by-step instructions ...]

@@ -158,8 +188,11 @@ Output each entity as a separate markdown document, delimited by
 `--- ENTITY: <entity-name> ---` markers.
 ```

-The three macros (`chapter_text`, `extraction_rules`, `vsm_framework`) are
-resolved by looking up artifacts by name in the relevant information spaces.
+The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`,
+`existing_entities`) are resolved by looking up artifacts by name in
+the relevant information spaces. The `existing_entities` list is
+dynamically generated at runtime from the canonical entity files
+already on disk, enabling incremental extraction without duplication.

 ### Template 2: Map to VSM (`templates/map-to-vsm.md`)

@@ -275,19 +308,23 @@ python process_chapters.py --all --provider openrouter --no-commit
 python process_chapters.py --list
 ```

-Prints a table showing which chapters have completed each stage:
+Prints a table showing which chapters have completed each stage
+(entity counts reflect the chapter view's transclusion references,
+including shared entities from earlier chapters):

 ```
 Available chapters (35):

  Chapter                        Entities     Mappings     Analysis
  ------------------------------ ------------ ------------ ------------
-  book-1-chapter-01              done         done         done
-  book-1-chapter-02              done         done         done
-  book-1-chapter-03              done         done         done
-  book-1-chapter-04              done         done         done
+  book-1-chapter-01              done (13)    done         done
+  book-1-chapter-02              done (7)     done         done
+  book-1-chapter-03              done (18)    done         done
+  book-1-chapter-04              done (5)     done         done
  book-1-chapter-05              -            -            -
  ...
+
+  Canonical entity set: 41 unique entities
 ```

 ### Assessing metrics
@@ -335,10 +372,20 @@ Place your key in one of these locations (checked in order):
   with real content)
 3. It writes the compiled prompt to `output/<stage>/<chapter>-prompt.md`
   for inspection
-4. If an LLM adapter is configured and no output file exists yet, it
-   **executes** the prompt and writes the result
-5. The output is **stored** as a generated artifact in the repository
-6. Dependency edges are **recorded** in the graph
+4. If no output exists yet and an LLM adapter is configured, it
+   **executes** the prompt
+5. **For entity extraction (stage 1):** the pipeline first binds the
+   list of already-existing entity slugs to `@{existing_entities}` so
+   the LLM knows what to skip. The LLM returns combined content with
+   `--- ENTITY: <name> ---` delimiters. The pipeline splits this into
+   the **flat canonical directory** (`output/entities/<slug>.md`),
+   skipping any entity whose slug already exists. It then generates the
+   chapter view file with transclusion directives. The combined content
+   is never persisted as a single file — canonical entity files are the
+   source of truth.
+6. **For other stages:** the result is written directly to its output file
+7. The output is **stored** as a generated artifact in the repository
+8. Dependency edges are **recorded** in the graph

 ---

@@ -347,7 +394,11 @@ Place your key in one of these locations (checked in order):
 Every processed chapter produces a git commit containing:

 - Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent
- Generated outputs (`*-entities.md`, `*-mappings.md`, `*-analysis.md`)
+- Canonical entity files (`output/entities/<slug>.md`) — one file per entity,
+  shared across chapters, first occurrence wins
+- Chapter entity views (`<chapter>-entities.md`) — transclusion into the
+  canonical entities relevant to each chapter
+- Generated outputs (`*-mappings.md`, `*-analysis.md`)

 This means:

@@ -366,7 +417,8 @@ To commit manually after reviewing:

 ```bash
 python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
-# review output/entities/book-1-chapter-05-entities.md etc.
+# review new entity files in output/entities/ (look for recently modified .md files)
+# review chapter view in output/entities/book-1-chapter-05-entities.md
 git add examples/infospace-with-history/output/
 git commit -m "infospace: process book-1-chapter-05"
 ```
@@ -407,7 +459,9 @@ how to complete the rest.
 python process_chapters.py --book 1 --provider openrouter --no-commit
 ```

-Already-processed chapters are skipped (their output files exist).
+Already-processed chapters are skipped (their chapter view files exist).
+Entities from earlier chapters are automatically shared — the LLM is
+told which entities already exist and avoids re-extracting them.

 **2. Process Books II-V:**

@@ -478,32 +532,44 @@ Example: if metrics show that S3* (Audit) is consistently missed, you
 could add a paragraph to `extraction-rules.md` explicitly asking the LLM
 to look for audit, inspection, and oversight mechanisms.

-To re-process a specific chapter:
+To re-process a specific chapter, remove its chapter view and downstream
+outputs. Note: canonical entity files in `output/entities/` are shared
+across chapters — only delete individual entity files if you want them
+re-extracted from scratch.

 ```bash
-rm examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
-rm examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
-rm examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
+rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
+rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
+rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
 python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
 ```

+To also re-extract specific entities, delete their canonical files first:
+
+```bash
+rm -f examples/infospace-with-history/output/entities/extent-of-the-market.md
+# then re-process the chapter as above
+```
+
 ---

-## 12. Infrastructure Issues Found
+## 12. Infrastructure Issues Found and Fixed

 During development we documented three issues with the MarkiTect
 infrastructure in `INFRA-TASKS.md`:

-1. **Artifact repo doesn't store content** — the resolver returns
-   placeholder text; the pipeline works around this with a local cache.
-2. **ContentMacro `raw_text` defaults to `""`** — causes silent data
-   corruption when macros are constructed programmatically.
-3. **No `@{target}` syntax in TemplateAnalyzer** — macros must be
+1. **Artifact repo doesn't store content** — the resolver returned
+   placeholder text instead of actual artifact content.
+2. **ContentMacro `raw_text` defaults to `""`** — caused silent data
+   corruption when macros were constructed programmatically.
+3. **No `@{target}` syntax in MacroParser** — macros had to be
   constructed manually rather than auto-detected from template text.

-These are intentionally not fixed in this example (the constraint was
-"no changes to markitect infrastructure"). They are tracked for future
-improvement, after which the experiment can be re-run.
+All three have been fixed in the markitect infrastructure. The pipeline
+script (`process_chapters.py`) has been refactored to use the fixed
+infrastructure directly — the local content cache, manual macro
+construction, and manual substitution workarounds have been removed.
+See `INFRA-TASKS.md` for details on each fix.

 ---