feat(infospace): flat canonical entity set with cross-chapter deduplication

Restructure entity storage from per-chapter subdirectories to a flat
canonical set in output/entities/. Each entity exists as a single file;
duplicates across chapters are detected by slug collision and skipped
(first occurrence wins). Chapter views use {{ include }} transclusion
to reference shared entity files.

Add @{existing_entities} macro to extract-entities template so the LLM
knows which entities already exist and focuses on genuinely new ones.
Refactor _call_llm() from _execute_llm() for callers that handle their
own file I/O. 41 unique entities from 4 chapters (2 duplicates removed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-11 22:24:20 +01:00
parent 706981c39f
commit 2d1282a61e
52 changed files with 1738 additions and 1376 deletions

View File

@@ -62,12 +62,38 @@ examples/infospace-with-history/
│ └── vsm-reference/ # VSM framework definition
└── output/ # Generated artifacts (LLM outputs)
├── entities/ # Per-chapter entity extractions
├── entities/ # Flat canonical entity set + chapter views
│ ├── division-of-labour.md # Canonical entity file (PRIMARY)
│ ├── exchange.md
│ ├── commercial-society.md
│ ├── ...
│ ├── book-1-chapter-01-entities.md # Chapter view (transclusion)
│ ├── book-1-chapter-01-prompt.md # Compiled prompt
│ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md
│ └── ...
├── mappings/ # Per-chapter VSM mappings
├── analyses/ # Per-chapter synthesised analyses
└── metrics/ # Cross-chapter metrics reports
```
**Entity organisation**: The infospace maintains a **flat canonical set**
of entities — one markdown file per entity, stored directly in
`output/entities/`. When a chapter mentions an entity that already exists
(detected by slug collision), the duplicate is skipped and the original
definition is kept. This builds a **minimal necessary and sufficient set**
of entities across the entire book.
Per-chapter `*-entities.md` files are **secondary views** that use
MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose
entity content by reference. The same entity (e.g., `division-of-labour.md`)
can appear in multiple chapter views. Editing a canonical entity file
automatically updates every chapter view that references it.
**Deduplication**: The pipeline tells the LLM which entities already exist
(via the `@{existing_entities}` macro in the extraction template) so it
focuses on genuinely new entities. At the file level, slug collisions
are detected and skipped as a safety net.
---
## 3. Designing Schemas
@@ -149,6 +175,10 @@ Adam Smith's *The Wealth of Nations*.
@{vsm_framework}
## Existing Entities
@{existing_entities}
## Instructions
[... detailed step-by-step instructions ...]
@@ -158,8 +188,11 @@ Output each entity as a separate markdown document, delimited by
`--- ENTITY: <entity-name> ---` markers.
```
The three macros (`chapter_text`, `extraction_rules`, `vsm_framework`) are
resolved by looking up artifacts by name in the relevant information spaces.
The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`,
`existing_entities`) are resolved by looking up artifacts by name in
the relevant information spaces. The `existing_entities` list is
dynamically generated at runtime from the canonical entity files
already on disk, enabling incremental extraction without duplication.
### Template 2: Map to VSM (`templates/map-to-vsm.md`)
@@ -275,19 +308,23 @@ python process_chapters.py --all --provider openrouter --no-commit
python process_chapters.py --list
```
Prints a table showing which chapters have completed each stage:
Prints a table showing which chapters have completed each stage
(entity counts reflect the chapter view's transclusion references,
including shared entities from earlier chapters):
```
Available chapters (35):
Chapter Entities Mappings Analysis
------------------------------ ------------ ------------ ------------
book-1-chapter-01 done done done
book-1-chapter-02 done done done
book-1-chapter-03 done done done
book-1-chapter-04 done done done
book-1-chapter-01 done (13) done done
book-1-chapter-02 done (7) done done
book-1-chapter-03 done (18) done done
book-1-chapter-04 done (5) done done
book-1-chapter-05 - - -
...
Canonical entity set: 41 unique entities
```
### Assessing metrics
@@ -335,10 +372,20 @@ Place your key in one of these locations (checked in order):
with real content)
3. It writes the compiled prompt to `output/<stage>/<chapter>-prompt.md`
for inspection
4. If an LLM adapter is configured and no output file exists yet, it
**executes** the prompt and writes the result
5. The output is **stored** as a generated artifact in the repository
6. Dependency edges are **recorded** in the graph
4. If no output exists yet and an LLM adapter is configured, it
**executes** the prompt
5. **For entity extraction (stage 1):** the pipeline first binds the
list of already-existing entity slugs to `@{existing_entities}` so
the LLM knows what to skip. The LLM returns combined content with
`--- ENTITY: <name> ---` delimiters. The pipeline splits this into
the **flat canonical directory** (`output/entities/<slug>.md`),
skipping any entity whose slug already exists. It then generates the
chapter view file with transclusion directives. The combined content
is never persisted as a single file — canonical entity files are the
source of truth.
6. **For other stages:** the result is written directly to its output file
7. The output is **stored** as a generated artifact in the repository
8. Dependency edges are **recorded** in the graph
---
@@ -347,7 +394,11 @@ Place your key in one of these locations (checked in order):
Every processed chapter produces a git commit containing:
- Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent
- Generated outputs (`*-entities.md`, `*-mappings.md`, `*-analysis.md`)
- Canonical entity files (`output/entities/<slug>.md`) — one file per entity,
shared across chapters, first occurrence wins
- Chapter entity views (`<chapter>-entities.md`) — transclusion into the
canonical entities relevant to each chapter
- Generated outputs (`*-mappings.md`, `*-analysis.md`)
This means:
@@ -366,7 +417,8 @@ To commit manually after reviewing:
```bash
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
# review output/entities/book-1-chapter-05-entities.md etc.
# review new entity files in output/entities/ (look for recently modified .md files)
# review chapter view in output/entities/book-1-chapter-05-entities.md
git add examples/infospace-with-history/output/
git commit -m "infospace: process book-1-chapter-05"
```
@@ -407,7 +459,9 @@ how to complete the rest.
python process_chapters.py --book 1 --provider openrouter --no-commit
```
Already-processed chapters are skipped (their output files exist).
Already-processed chapters are skipped (their chapter view files exist).
Entities from earlier chapters are automatically shared — the LLM is
told which entities already exist and avoids re-extracting them.
**2. Process Books II-V:**
@@ -478,32 +532,44 @@ Example: if metrics show that S3* (Audit) is consistently missed, you
could add a paragraph to `extraction-rules.md` explicitly asking the LLM
to look for audit, inspection, and oversight mechanisms.
To re-process a specific chapter:
To re-process a specific chapter, remove its chapter view and downstream
outputs. Note: canonical entity files in `output/entities/` are shared
across chapters — only delete individual entity files if you want them
re-extracted from scratch.
```bash
rm examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
rm examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
rm examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
```
To also re-extract specific entities, delete their canonical files first:
```bash
rm -f examples/infospace-with-history/output/entities/extent-of-the-market.md
# then re-process the chapter as above
```
---
## 12. Infrastructure Issues Found
## 12. Infrastructure Issues Found and Fixed
During development we documented three issues with the MarkiTect
infrastructure in `INFRA-TASKS.md`:
1. **Artifact repo doesn't store content** — the resolver returns
placeholder text; the pipeline works around this with a local cache.
2. **ContentMacro `raw_text` defaults to `""`** — causes silent data
corruption when macros are constructed programmatically.
3. **No `@{target}` syntax in TemplateAnalyzer** — macros must be
1. **Artifact repo doesn't store content** — the resolver returned
placeholder text instead of actual artifact content.
2. **ContentMacro `raw_text` defaults to `""`** — caused silent data
corruption when macros were constructed programmatically.
3. **No `@{target}` syntax in MacroParser** — macros had to be
constructed manually rather than auto-detected from template text.
These are intentionally not fixed in this example (the constraint was
"no changes to markitect infrastructure"). They are tracked for future
improvement, after which the experiment can be re-run.
All three have been fixed in the markitect infrastructure. The pipeline
script (`process_chapters.py`) has been refactored to use the fixed
infrastructure directly — the local content cache, manual macro
construction, and manual substitution workarounds have been removed.
See `INFRA-TASKS.md` for details on each fix.
---