feat(infospace): flat canonical entity set with cross-chapter deduplication
Restructure entity storage from per-chapter subdirectories to a flat
canonical set in output/entities/. Each entity exists as a single file;
duplicates across chapters are detected by slug collision and skipped
(first occurrence wins). Chapter views use {{ include }} transclusion
to reference shared entity files.
Add @{existing_entities} macro to extract-entities template so the LLM
knows which entities already exist and focuses on genuinely new ones.
Refactor _call_llm() from _execute_llm() for callers that handle their
own file I/O. 41 unique entities from 4 chapters (2 duplicates removed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -62,12 +62,38 @@ examples/infospace-with-history/
|
||||
│ └── vsm-reference/ # VSM framework definition
|
||||
│
|
||||
└── output/ # Generated artifacts (LLM outputs)
|
||||
├── entities/ # Per-chapter entity extractions
|
||||
├── entities/ # Flat canonical entity set + chapter views
|
||||
│ ├── division-of-labour.md # Canonical entity file (PRIMARY)
|
||||
│ ├── exchange.md
|
||||
│ ├── commercial-society.md
|
||||
│ ├── ...
|
||||
│ ├── book-1-chapter-01-entities.md # Chapter view (transclusion)
|
||||
│ ├── book-1-chapter-01-prompt.md # Compiled prompt
|
||||
│ ├── book-1-chapter-04-entities.md # Also references division-of-labour.md
|
||||
│ └── ...
|
||||
├── mappings/ # Per-chapter VSM mappings
|
||||
├── analyses/ # Per-chapter synthesised analyses
|
||||
└── metrics/ # Cross-chapter metrics reports
|
||||
```
|
||||
|
||||
**Entity organisation**: The infospace maintains a **flat canonical set**
|
||||
of entities — one markdown file per entity, stored directly in
|
||||
`output/entities/`. When a chapter mentions an entity that already exists
|
||||
(detected by slug collision), the duplicate is skipped and the original
|
||||
definition is kept. This builds a **minimal necessary and sufficient set**
|
||||
of entities across the entire book.
|
||||
|
||||
Per-chapter `*-entities.md` files are **secondary views** that use
|
||||
MarkiTect's transclusion engine (`{{ include "entity.md" }}`) to compose
|
||||
entity content by reference. The same entity (e.g., `division-of-labour.md`)
|
||||
can appear in multiple chapter views. Editing a canonical entity file
|
||||
automatically updates every chapter view that references it.
|
||||
|
||||
**Deduplication**: The pipeline tells the LLM which entities already exist
|
||||
(via the `@{existing_entities}` macro in the extraction template) so it
|
||||
focuses on genuinely new entities. At the file level, slug collisions
|
||||
are detected and skipped as a safety net.
|
||||
|
||||
---
|
||||
|
||||
## 3. Designing Schemas
|
||||
@@ -149,6 +175,10 @@ Adam Smith's *The Wealth of Nations*.
|
||||
|
||||
@{vsm_framework}
|
||||
|
||||
## Existing Entities
|
||||
|
||||
@{existing_entities}
|
||||
|
||||
## Instructions
|
||||
[... detailed step-by-step instructions ...]
|
||||
|
||||
@@ -158,8 +188,11 @@ Output each entity as a separate markdown document, delimited by
|
||||
`--- ENTITY: <entity-name> ---` markers.
|
||||
```
|
||||
|
||||
The three macros (`chapter_text`, `extraction_rules`, `vsm_framework`) are
|
||||
resolved by looking up artifacts by name in the relevant information spaces.
|
||||
The four macros (`chapter_text`, `extraction_rules`, `vsm_framework`,
|
||||
`existing_entities`) are resolved by looking up artifacts by name in
|
||||
the relevant information spaces. The `existing_entities` list is
|
||||
dynamically generated at runtime from the canonical entity files
|
||||
already on disk, enabling incremental extraction without duplication.
|
||||
|
||||
### Template 2: Map to VSM (`templates/map-to-vsm.md`)
|
||||
|
||||
@@ -275,19 +308,23 @@ python process_chapters.py --all --provider openrouter --no-commit
|
||||
python process_chapters.py --list
|
||||
```
|
||||
|
||||
Prints a table showing which chapters have completed each stage:
|
||||
Prints a table showing which chapters have completed each stage
|
||||
(entity counts reflect the chapter view's transclusion references,
|
||||
including shared entities from earlier chapters):
|
||||
|
||||
```
|
||||
Available chapters (35):
|
||||
|
||||
Chapter Entities Mappings Analysis
|
||||
------------------------------ ------------ ------------ ------------
|
||||
book-1-chapter-01 done done done
|
||||
book-1-chapter-02 done done done
|
||||
book-1-chapter-03 done done done
|
||||
book-1-chapter-04 done done done
|
||||
book-1-chapter-01 done (13) done done
|
||||
book-1-chapter-02 done (7) done done
|
||||
book-1-chapter-03 done (18) done done
|
||||
book-1-chapter-04 done (5) done done
|
||||
book-1-chapter-05 - - -
|
||||
...
|
||||
|
||||
Canonical entity set: 41 unique entities
|
||||
```
|
||||
|
||||
### Assessing metrics
|
||||
@@ -335,10 +372,20 @@ Place your key in one of these locations (checked in order):
|
||||
with real content)
|
||||
3. It writes the compiled prompt to `output/<stage>/<chapter>-prompt.md`
|
||||
for inspection
|
||||
4. If an LLM adapter is configured and no output file exists yet, it
|
||||
**executes** the prompt and writes the result
|
||||
5. The output is **stored** as a generated artifact in the repository
|
||||
6. Dependency edges are **recorded** in the graph
|
||||
4. If no output exists yet and an LLM adapter is configured, it
|
||||
**executes** the prompt
|
||||
5. **For entity extraction (stage 1):** the pipeline first binds the
|
||||
list of already-existing entity slugs to `@{existing_entities}` so
|
||||
the LLM knows what to skip. The LLM returns combined content with
|
||||
`--- ENTITY: <name> ---` delimiters. The pipeline splits this into
|
||||
the **flat canonical directory** (`output/entities/<slug>.md`),
|
||||
skipping any entity whose slug already exists. It then generates the
|
||||
chapter view file with transclusion directives. The combined content
|
||||
is never persisted as a single file — canonical entity files are the
|
||||
source of truth.
|
||||
6. **For other stages:** the result is written directly to its output file
|
||||
7. The output is **stored** as a generated artifact in the repository
|
||||
8. Dependency edges are **recorded** in the graph
|
||||
|
||||
---
|
||||
|
||||
@@ -347,7 +394,11 @@ Place your key in one of these locations (checked in order):
|
||||
Every processed chapter produces a git commit containing:
|
||||
|
||||
- Compiled prompts (`*-prompt.md`) — so you can audit exactly what was sent
|
||||
- Generated outputs (`*-entities.md`, `*-mappings.md`, `*-analysis.md`)
|
||||
- Canonical entity files (`output/entities/<slug>.md`) — one file per entity,
|
||||
shared across chapters, first occurrence wins
|
||||
- Chapter entity views (`<chapter>-entities.md`) — transclusion into the
|
||||
canonical entities relevant to each chapter
|
||||
- Generated outputs (`*-mappings.md`, `*-analysis.md`)
|
||||
|
||||
This means:
|
||||
|
||||
@@ -366,7 +417,8 @@ To commit manually after reviewing:
|
||||
|
||||
```bash
|
||||
python process_chapters.py --chapter book-1-chapter-05 --provider openrouter --no-commit
|
||||
# review output/entities/book-1-chapter-05-entities.md etc.
|
||||
# review new entity files in output/entities/ (look for recently modified .md files)
|
||||
# review chapter view in output/entities/book-1-chapter-05-entities.md
|
||||
git add examples/infospace-with-history/output/
|
||||
git commit -m "infospace: process book-1-chapter-05"
|
||||
```
|
||||
@@ -407,7 +459,9 @@ how to complete the rest.
|
||||
python process_chapters.py --book 1 --provider openrouter --no-commit
|
||||
```
|
||||
|
||||
Already-processed chapters are skipped (their output files exist).
|
||||
Already-processed chapters are skipped (their chapter view files exist).
|
||||
Entities from earlier chapters are automatically shared — the LLM is
|
||||
told which entities already exist and avoids re-extracting them.
|
||||
|
||||
**2. Process Books II-V:**
|
||||
|
||||
@@ -478,32 +532,44 @@ Example: if metrics show that S3* (Audit) is consistently missed, you
|
||||
could add a paragraph to `extraction-rules.md` explicitly asking the LLM
|
||||
to look for audit, inspection, and oversight mechanisms.
|
||||
|
||||
To re-process a specific chapter:
|
||||
To re-process a specific chapter, remove its chapter view and downstream
|
||||
outputs. Note: canonical entity files in `output/entities/` are shared
|
||||
across chapters — only delete individual entity files if you want them
|
||||
re-extracted from scratch.
|
||||
|
||||
```bash
|
||||
rm examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
|
||||
rm examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
|
||||
rm examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
|
||||
rm -f examples/infospace-with-history/output/entities/book-1-chapter-03-entities.md
|
||||
rm -f examples/infospace-with-history/output/mappings/book-1-chapter-03-mappings.md
|
||||
rm -f examples/infospace-with-history/output/analyses/book-1-chapter-03-analysis.md
|
||||
python process_chapters.py --chapter book-1-chapter-03 --provider openrouter --no-commit
|
||||
```
|
||||
|
||||
To also re-extract specific entities, delete their canonical files first:
|
||||
|
||||
```bash
|
||||
rm -f examples/infospace-with-history/output/entities/extent-of-the-market.md
|
||||
# then re-process the chapter as above
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Infrastructure Issues Found
|
||||
## 12. Infrastructure Issues Found and Fixed
|
||||
|
||||
During development we documented three issues with the MarkiTect
|
||||
infrastructure in `INFRA-TASKS.md`:
|
||||
|
||||
1. **Artifact repo doesn't store content** — the resolver returns
|
||||
placeholder text; the pipeline works around this with a local cache.
|
||||
2. **ContentMacro `raw_text` defaults to `""`** — causes silent data
|
||||
corruption when macros are constructed programmatically.
|
||||
3. **No `@{target}` syntax in TemplateAnalyzer** — macros must be
|
||||
1. **Artifact repo doesn't store content** — the resolver returned
|
||||
placeholder text instead of actual artifact content.
|
||||
2. **ContentMacro `raw_text` defaults to `""`** — caused silent data
|
||||
corruption when macros were constructed programmatically.
|
||||
3. **No `@{target}` syntax in MacroParser** — macros had to be
|
||||
constructed manually rather than auto-detected from template text.
|
||||
|
||||
These are intentionally not fixed in this example (the constraint was
|
||||
"no changes to markitect infrastructure"). They are tracked for future
|
||||
improvement, after which the experiment can be re-run.
|
||||
All three have been fixed in the markitect infrastructure. The pipeline
|
||||
script (`process_chapters.py`) has been refactored to use the fixed
|
||||
infrastructure directly — the local content cache, manual macro
|
||||
construction, and manual substitution workarounds have been removed.
|
||||
See `INFRA-TASKS.md` for details on each fix.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user