docs: metrics methodology, collection-level tasks, and infospace tooling roadmap

Add METRICS-METHODOLOGY.md documenting the theoretical frameworks (SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for two-layer evaluation (LLM-Eval + deterministic aggregation) across five collection concerns: redundancy, coverage, coherence, consistency, and granularity balance. Extend INFRA-TASKS.md with assignment assessment (tasks 4-7), per-concept metrics (tasks 8-12), and collection-level metrics (tasks 13-19). Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace, topic, discipline, entity, evaluation, viability) and a three-stage implementation plan: Stage 1 platform additions, Stage 2 infospace tooling layer, Stage 3 example revision. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 23:53:21 +01:00
parent 2f0989f9bf
commit 4ce856d4d0
3 changed files with 1632 additions and 0 deletions
--- a/examples/infospace-with-history/INFRA-TASKS.md
+++ b/examples/infospace-with-history/INFRA-TASKS.md
@@ -37,3 +37,513 @@ no automatic parsing for this format, requiring manual macro construction.
 **Fix applied:** Added `SHORTHAND_PATTERN` to `MacroParser` that recognises
 `@{target}` and maps it to `MacroKind.REQUIRED`. Updated `has_macros()`,
 `count_macros()`, and `find_macro_positions()` accordingly.
+
+---
+
+## Assignment Assessment (18 Feb 2026)
+
+How the example measures against the objectives stated in `README.md`:
+
+| # | Objective | Status | Notes |
+|---|-----------|--------|-------|
+| 1 | Capture knowledge from Wealth of Nations | **Partial** | 7 of 35 chapters processed (Book I, ch. 1-7). 85 canonical entities extracted. |
+| 2 | Transform to VSM concepts/entities | **Done (for processed chapters)** | Entities mapped to S1-S5 with strength ratings. |
+| 3 | Consistent and complete | **Not yet** | Only 20% of chapters done. Metrics report exists but covers limited scope. |
+| 4 | Schemas as scaffolding | **Done** | Four schemas defined and used across all stages. |
+| 5 | Prompt dependency resolution | **Done** | `@{macro}` templates resolved via MultiSpaceResolutionStrategy. |
+| 6 | Incremental chapter injection | **Done** | Pipeline processes one chapter at a time; `@{existing_entities}` prevents duplication. |
+| 7 | Keep changes as git history | **Not done** | See task 4 below. |
+| 8 | Metrics for completeness/consistency | **Partial** | Template and report exist but only cover 4 chapters (report predates ch. 5-7). |
+| 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. |
+| 10 | Generate task list for infra issues | **Done** | This file. |
+
+## 4. Infospace has no per-chapter git history — OPEN
+
+**Objective:** README states "The information space should utilize the option
+of keeping changes as git history."
+**Issue:** The 7 processed chapters were committed in mixed batches alongside
+infrastructure changes (LLM adapters, entity refactoring, archive policy).
+Chapters 1-2 are bundled into `fecc2fd` with the entire LLM module.
+Chapters 5-7 share a single commit (`41773f1`) with the OpenAI adapter and
+archive policy. There is no commit where you can `git diff` to see exactly
+what one chapter contributed to the infospace.
+**Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how
+the infospace grew chapter by chapter — the core promise of "with history."
+**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using
+`process_chapters.py` without `--no-commit`, on a clean branch or after
+squashing the current output into a baseline commit. Each chapter gets its
+own commit via `_git_commit_chapter()`.
+
+## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN
+
+**Issue:** Running `--all --no-commit` to regenerate `infospace.db` also
+overwrites `*-prompt.md` files in the output directories because each
+pipeline stage unconditionally writes the compiled prompt before checking
+whether output already exists. The `@{existing_entities}` macro content
+shifts as earlier chapters are loaded, so prompt files for already-processed
+chapters change on every full run.
+**Impact:** A DB regeneration dirties the working tree with prompt file
+changes, even though no actual outputs changed. Users must `git checkout`
+the prompt files after regeneration.
+**Suggested fix:** Skip writing prompt files when the corresponding output
+file already exists on disk, or add a `--rebuild-db-only` flag that
+populates the database without touching the file system.
+
+## 6. Metrics report is stale — OPEN
+
+**Issue:** The metrics report (`output/metrics/metrics-report.md`) was
+generated after chapters 1-4. Chapters 5-7 have since been processed but
+the report has not been refreshed.
+**Impact:** The metrics do not reflect the current state of the infospace.
+**Suggested fix:** Re-run `--metrics --provider <provider> --no-commit`
+after every batch of new chapters. Consider making metrics assessment
+automatic at the end of `--book` or `--all` runs.
+
+## 7. Remaining 28 chapters not yet processed — OPEN
+
+**Issue:** Only Book I chapters 1-7 have been processed. Books II-V
+(28 chapters) remain unprocessed.
+**Impact:** The infospace is incomplete — VSM coverage is limited to S1,
+S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic
+signals, recursion, variety) are expected to emerge from later books.
+**Suggested fix:** Process remaining chapters in book-sized batches with
+per-chapter commits, refreshing metrics after each book.
+
+---
+
+## Per-Concept Metrics (tasks 8-12)
+
+The current metrics system is a single LLM-evaluated narrative report that
+assesses the infospace as a whole. It produces no machine-readable output,
+cannot be tracked over time, and conflates per-concept quality with
+collection-level coherence.
+
+The improvement splits metrics into two layers:
+
+- **LLM-Eval**: A prompt template evaluates each concept individually
+  against quality criteria defined in the schema. The LLM returns structured
+  scores, not prose.
+- **Deterministic aggregation**: `process_chapters.py` computes what it can
+  from files on disk (schema compliance, word counts, section presence,
+  coverage tallies) and aggregates LLM-eval scores into dashboard metrics.
+
+Both layers persist results in structured form so they can be diffed,
+tracked over time, and committed alongside the entities they evaluate.
+
+## 8. Add per-concept quality metrics to entity schema — OPEN
+
+**Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines
+required sections and validation rules (section presence, word count range)
+but no quality criteria. There is no definition of what makes a *good*
+entity versus a merely *compliant* one.
+**Suggested fix:** Add a `## Quality Metrics` section to the entity schema
+defining evaluation dimensions with scoring rubrics:
+
+- **Definition Precision** (1-5): Is the definition specific, non-circular,
+  and distinguishable from neighbouring concepts?
+- **Source Grounding** (1-5): Is the entity grounded in a specific passage?
+  Does the citation exist and support the definition?
+- **Domain Placement** (1-5): Is the economic domain assignment correct and
+  specific (not just "General Theory")?
+- **VSM Relevance** (1-5): Does the entity connect meaningfully to at least
+  one VSM system, or is it too granular/abstract to map?
+- **Explanatory Value** (1-5): Does this entity contribute to explaining
+  the economic system, or is it a restatement of another concept?
+
+Similarly update the VSM mapping schema with:
+
+- **Rationale Rigour** (1-5): Is the mapping justified with reference to
+  Beer's definitions, not just surface-level analogy?
+- **Strength Calibration** (1-5): Is the declared strength (Strong/Moderate/
+  Weak) consistent with the rationale given?
+
+These rubrics become the prompt instructions for task 9.
+
+## 9. Create evaluate-entity prompt template — OPEN
+
+**Depends on:** Task 8 (quality metrics in schema).
+**Issue:** There is no mechanism to evaluate an existing entity after
+extraction. Quality is only judged implicitly during the global metrics
+assessment, which is too coarse to identify individual weak entities.
+**Suggested fix:** Create `templates/evaluate-entity.md` — a prompt
+template that:
+
+1. Takes `@{entity_content}`, `@{source_chapter}`, `@{vsm_framework}`,
+   and `@{quality_rubric}` (from the schema's quality metrics section).
+2. Asks the LLM to score each dimension (1-5) with a one-sentence
+   justification per score.
+3. Outputs structured YAML front-matter (scores) followed by markdown
+   (justifications), e.g.:
+
+```yaml
+---
+entity: division-of-labour
+scores:
+  definition_precision: 5
+  source_grounding: 5
+  domain_placement: 4
+  vsm_relevance: 5
+  explanatory_value: 5
+overall: 4.8
+flags: []
+---
+```
+
+Add a pipeline stage: `--evaluate` runs this template against every
+canonical entity and writes results to `output/evaluations/<slug>-eval.md`.
+A `--evaluate --chapter <id>` variant evaluates only entities introduced
+by that chapter.
+
+## 10. Add deterministic schema compliance checker — OPEN
+
+**Issue:** Schema compliance is currently LLM-evaluated ("100%" in the
+metrics report) but the validation rules in the schemas are mechanical:
+section presence, word count ranges, heading format. These should be
+checked programmatically, not by an LLM.
+**Suggested fix:** Add a `validate_entity(path) -> ValidationResult`
+function to `process_chapters.py` (or a new `validate.py` module) that:
+
+- Parses the markdown to extract H2 section headings
+- Checks required sections are present (Definition, Source Chapter,
+  Context, Economic Domain)
+- Counts words in the Definition section (must be 20-150)
+- Checks H1 heading exists and is not a slug (e.g. `effectual-demand`
+  in chapter 7 has `# effectual-demand` instead of `# Effectual Demand`)
+- Validates Source Chapter cites a specific book/chapter
+- For mapping files: checks Mapping Strength is one of the enum values
+
+Expose as `--validate` CLI flag. Output a structured report:
+
+```
+Validation: 85 entities, 3 warnings
+  effectual-demand.md: H1 is slug format, not title case
+  porter.md: Definition is 18 words (minimum 20)
+  ...
+```
+
+This is fully deterministic — no LLM calls needed.
+
+## 11. Structured metrics output format — OPEN
+
+**Depends on:** Tasks 9 and 10.
+**Issue:** The metrics report is a markdown narrative. Values cannot be
+parsed programmatically, diffed meaningfully, or plotted over time.
+**Suggested fix:** Alongside the human-readable `metrics-report.md`,
+emit a machine-readable `metrics.yaml` (or `.json`) containing:
+
+```yaml
+timestamp: "2026-02-18T12:00:00Z"
+chapters_processed: 7
+chapters_total: 35
+entities_total: 85
+entities_archived: 0
+vsm_coverage:
+  S1: 28
+  S2: 12
+  S3: 8
+  S3_star: 0
+  S4: 5
+  S5: 0
+  recursion: 1
+  variety: 0
+mapping_strength:
+  strong: 64
+  moderate: 18
+  weak: 3
+validation:
+  schema_compliant: 82
+  warnings: 3
+evaluation:    # from LLM-eval (task 9)
+  mean_overall: 4.2
+  min_overall: 2.8
+  flagged_entities: ["porter", "country-workman"]
+```
+
+The `--metrics` command writes both files. The YAML file is committed
+to git so `git diff` shows exactly how metrics changed between runs.
+
+## 12. Metrics-over-time tracking — OPEN
+
+**Depends on:** Task 11 (structured output).
+**Issue:** There is one metrics snapshot that gets overwritten. No history
+of how metrics evolved as chapters were added.
+**Suggested fix:** Append each metrics snapshot to a cumulative log file
+`output/metrics/metrics-history.yaml` (list of timestamped entries). This
+is committed to git alongside the current snapshot. The pipeline can
+optionally render a simple text-based progress summary:
+
+```
+Metrics history (5 snapshots):
+  2026-02-10  ch 1/35   13 entities  41.7% VSM coverage
+  2026-02-11  ch 4/35   38 entities  50.0% VSM coverage
+  2026-02-11  ch 7/35   85 entities  58.3% VSM coverage
+  ...
+```
+
+This provides the "metrics that improve over time" feedback loop the
+README envisions: process chapters → evaluate → see coverage grow (or
+flag regressions when a re-extraction reduces quality scores).
+
+---
+
+## Collection-Level Metrics (tasks 13-19)
+
+These tasks implement the five collection-level concerns described in
+`METRICS-METHODOLOGY.md`. They share underlying infrastructure (entity
+metadata index, definition embeddings, relationship graph) that should
+be built once per evaluation run.
+
+See the methodology document for theoretical grounding, framework
+references, and the full metric definitions per concern.
+
+## 13. Entity metadata index — deterministic parsing layer — OPEN
+
+**Depends on:** Task 10 (schema compliance checker shares parsing logic).
+**Issue:** Several collection-level metrics (coverage matrix, FCA context,
+granularity distribution) require structured metadata extracted from entity
+files: H1 title, economic domain, VSM system(s), source chapter, section
+presence, word counts. Currently this information exists only as prose
+inside markdown files.
+**Suggested fix:** Add a `parse_entity_metadata(path) -> EntityMeta`
+function that extracts from each entity file:
+
+```python
+@dataclass
+class EntityMeta:
+    slug: str
+    title: str                  # from H1
+    domain: str                 # from Economic Domain section
+    source_chapter: str         # from Source Chapter section
+    definition_words: int       # word count of Definition section
+    has_original_wording: bool  # optional section present?
+    has_modern_interpretation: bool
+    vsm_systems: list[str]     # from mapping file if exists
+    mapping_strengths: list[str]
+```
+
+Build an index of all entities at the start of each evaluation run.
+This index is the input for tasks 14, 16, and 18. Expose as
+`--index` CLI flag for inspection.
+
+## 14. Redundancy detection (Concern C1) — OPEN
+
+**Depends on:** Task 13 (metadata index).
+**Methodology:** OOPS! P2 (synonymous classes) + embedding similarity +
+LLM pairwise judgment. See METRICS-METHODOLOGY.md §4 C1.
+**Issue:** Entities with different slugs but overlapping meanings (e.g.
+`natural-rate` / `ordinary-or-average-rate`) survive extraction because
+dedup only checks slug collisions. There is no semantic overlap detection.
+**Suggested fix:** Implement in three stages:
+
+1. **Embed** — Compute vector embeddings of all entity definitions using
+   an embedding API (OpenRouter, OpenAI, or a local sentence-transformer).
+   Cache embeddings in `output/metrics/embeddings.json` keyed by
+   `{slug: content_digest}` so unchanged entities skip re-embedding.
+
+2. **Similarity matrix** — Compute NxN cosine similarity. Write the full
+   matrix to `output/metrics/similarity-matrix.json`. Flag all pairs with
+   cosine > 0.80 as candidates.
+
+3. **LLM pairwise judgment** — For each candidate pair, run a prompt:
+   "Given these two entity definitions, are they (a) the same concept and
+   should be merged, (b) genuinely distinct, or (c) partially overlapping
+   and should be clarified?" Write results to
+   `output/metrics/redundancy-report.md` + YAML.
+
+**Metrics produced:**
+- `high_similarity_pairs`: count and list
+- `confirmed_synonyms`: count (LLM-confirmed same concept)
+- `redundancy_ratio`: `confirmed_synonyms / total_entities`
+- `intensional_conciseness`: `1 - redundancy_ratio`
+
+**CLI:** `--check-redundancy --provider <provider>`
+
+## 15. Coverage completeness (Concern C2) — OPEN
+
+**Depends on:** Task 13 (metadata index).
+**Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency
+questions. See METRICS-METHODOLOGY.md §4 C2.
+**Issue:** Coverage is currently assessed by the LLM in a single narrative
+pass. There is no structured view of which domain × VSM cells are
+populated, and no way to test whether the entity set can answer specific
+questions about the economic system.
+**Suggested fix:** Implement in three stages:
+
+1. **Domain × VSM matrix** — From the metadata index, count entities per
+   {economic_domain, vsm_system} cell. Render as a table. Identify empty
+   cells as specific, actionable gaps. Compute:
+   - `coverage_ratio = populated_cells / total_cells`
+   - `vsm_balance_entropy = -Σ(pᵢ log pᵢ)` across VSM systems
+
+2. **FCA lattice** — Construct a formal context with objects = entities,
+   attributes = {domain, vsm_system, source_book, abstraction_level}.
+   Compute the concept lattice (Python `concepts` library). Extract
+   attribute combinations with no corresponding entity — these are
+   **structural coverage gaps** not visible in the simple matrix.
+
+3. **Competency questions** — Define a set of 15-20 canonical questions
+   the infospace should answer (stored in
+   `schemas/competency-questions.md`). Example questions:
+   - "How does the division of labour relate to market extent?"
+   - "What mechanisms regulate wages toward their natural rate?"
+   - "How do monopolies distort the viable system?"
+   LLM-Eval tests whether current entities suffice to answer each.
+   Unanswerable questions identify specific completeness gaps.
+
+**Metrics produced:**
+- `domain_vsm_matrix`: cell counts
+- `coverage_ratio`: scalar
+- `vsm_balance_entropy`: scalar
+- `empty_cells`: list of {domain, vsm_system} gaps
+- `fca_gap_concepts`: attribute combos with no entity
+- `competency_coverage`: fraction of questions answerable
+
+**CLI:** `--check-coverage --provider <provider>`
+
+## 16. Structural coherence (Concern C3) — OPEN
+
+**Depends on:** Task 13 (metadata index).
+**Methodology:** OntoQA relationship richness + graph connectivity +
+community detection. See METRICS-METHODOLOGY.md §4 C3.
+**Issue:** It is unknown whether the 85 entities form a connected
+explanatory web or a fragmented collection. No relationship graph exists
+between entities.
+**Suggested fix:** Implement in three stages:
+
+1. **Explicit cross-references** — Scan each entity's definition for
+   mentions of other entity slugs or titles (normalised string matching).
+   This is deterministic and catches direct references.
+
+2. **LLM-inferred edges** — For entity pairs not caught by string
+   matching but in the same domain or VSM system, LLM-Eval: "Does A's
+   definition conceptually depend on or explain B, or vice versa?" Run
+   in batches. Write the combined graph to
+   `output/metrics/relationship-graph.json` (adjacency list).
+
+3. **Graph analysis** — Using networkx or equivalent:
+   - Connected components (target: 1)
+   - Graph density, average degree
+   - Betweenness centrality → identify bridge concepts
+   - Louvain community detection → compare to declared domains
+   - OntoQA Relationship Richness
+   - Cohesion per domain, coupling across domains
+   - Orphan entities (degree 0 or 1)
+
+**Metrics produced:**
+- `connected_components`: count (target: 1)
+- `graph_density`: scalar
+- `avg_degree`: scalar
+- `relationship_richness`: OntoQA RR
+- `modularity`: Louvain score
+- `bridge_concepts`: list (high betweenness centrality)
+- `orphan_entities`: list (degree ≤ 1)
+- `cohesion_by_domain` / `coupling_across_domains`: scalars
+
+**CLI:** `--check-coherence --provider <provider>`
+
+## 17. Definitional consistency (Concern C4) — OPEN
+
+**Depends on:** Task 16 (relationship graph — the definitional dependency
+graph is a directed variant of the same structure).
+**Methodology:** OntoClean metaproperties + OOPS! P24 (circular
+definitions) + SEQUAL validity. See METRICS-METHODOLOGY.md §4 C4.
+**Issue:** No mechanism to detect circular definitions, contradictions
+between related entities, or terms used in definitions that should be
+entities but aren't.
+**Suggested fix:** Implement in four stages:
+
+1. **Definitional dependency graph** — Directed version of the
+   relationship graph: edge A→B means A's definition uses B's concept.
+   Reuse cross-reference extraction from task 16.
+
+2. **Cycle detection** — Find all cycles of length ≤ 3 in the directed
+   graph. Short cycles are problematic (A defines B, B defines A).
+   Compute `grounding_ratio`: fraction of entities traceable to terms
+   outside the entity set without encountering a cycle.
+
+3. **Undefined dependencies** — Extract terms from definitions that match
+   entity-name patterns (capitalised noun phrases, kebab-case slugs) but
+   have no corresponding entity file. These are concepts the infospace
+   implicitly relies on but hasn't defined.
+
+4. **LLM consistency checks** — For directly-connected entity pairs,
+   LLM-Eval: "Do these definitions contradict each other?" For entities
+   with Smith's Original Wording, LLM-Eval: "Does the definition
+   accurately represent the cited passage?"
+
+**Metrics produced:**
+- `circular_definitions`: count and list of cycles (length ≤ 3)
+- `grounding_ratio`: fraction of entities reaching primitives
+- `undefined_dependencies`: list of missing terms
+- `contradiction_candidates`: LLM-flagged pairs
+- `source_fidelity_score`: fraction passing source check
+
+**CLI:** `--check-consistency --provider <provider>`
+
+## 18. Granularity balance (Concern C5) — OPEN
+
+**Depends on:** Task 13 (metadata index).
+**Methodology:** Keet granularity theory + OntoClean rigidity +
+DSL laconicity. See METRICS-METHODOLOGY.md §4 C5.
+**Issue:** Entities range from broad sectors (`agriculture`) to specific
+market roles (`effectual-demanders`) to abstract principles
+(`division-of-labour`). It is unclear whether this range is appropriate
+or whether some entities are too specific/general relative to their peers.
+**Suggested fix:** Implement in three stages:
+
+1. **LLM classification** — For each entity, LLM-Eval assigns:
+   - Abstraction level: `theory` / `mechanism` / `observation`
+   - Scope score: 1-5 (very specific → very general)
+   - Indispensability: 1-5 ("if removed, how much explanatory power lost?")
+   Write to `output/evaluations/<slug>-classification.yaml`.
+
+2. **Distribution analysis** — Deterministic:
+   - Count per abstraction level; compute entropy
+   - Per-domain scope variance (flag domains with high variance)
+   - Level × domain matrix (from FCA context in task 15)
+   - Outlier detection: entities > 1.5σ from their domain's mean scope
+
+3. **Merge/split recommendations** — For outlier entities, LLM-Eval:
+   "Should this entity be merged into a broader concept, split into
+   sub-concepts, or is its current granularity justified?" For entities
+   with indispensability ≤ 2: "Could another entity serve this purpose?"
+
+**Metrics produced:**
+- `abstraction_distribution`: {theory: n, mechanism: n, observation: n}
+- `abstraction_entropy`: scalar (higher = more balanced)
+- `scope_variance_by_domain`: per-domain scalar
+- `dispensable_entities`: list (indispensability ≤ 2)
+- `merge_candidates`: list of pairs
+- `split_candidates`: list of entities
+
+**CLI:** `--check-granularity --provider <provider>`
+
+## 19. Unified collection evaluation command — OPEN
+
+**Depends on:** Tasks 13-18.
+**Issue:** Running five separate `--check-*` commands is cumbersome and
+repeats shared computation (metadata parsing, embedding, graph building).
+**Suggested fix:** Add `--evaluate-collection --provider <provider>` that
+runs all five checks in sequence, sharing infrastructure:
+
+1. Parse entity metadata index (task 13) — used by all
+2. Compute embeddings (task 14) — used by C1, C3
+3. Build relationship graph (task 16) — used by C3, C4
+4. Run all five concern checks
+5. Write per-concern reports to `output/metrics/`
+6. Write unified `metrics.yaml` with all collection metrics
+7. Append to `metrics-history.yaml` (task 12)
+
+Incremental mode: `--evaluate-collection --chapter <id>` re-evaluates
+only entities from that chapter plus pairwise checks involving them.
+
+Report a summary to stdout:
+
+```
+Collection evaluation (85 entities, 7 chapters):
+  Redundancy:   3 synonym candidates, conciseness 0.96
+  Coverage:     58% VSM, 20% chapters, 4 domain gaps
+  Coherence:    1 component, density 0.12, 2 orphans
+  Consistency:  0 cycles, 5 undefined deps, 0 contradictions
+  Granularity:  entropy 1.42, 1 dispensable, 2 merge candidates
+```
--- a/examples/infospace-with-history/METRICS-METHODOLOGY.md
+++ b/examples/infospace-with-history/METRICS-METHODOLOGY.md
@@ -0,0 +1,501 @@
+# Collection-Level Metrics Methodology
+
+How we evaluate the quality of the infospace as a **collection of
+interrelated concepts**, beyond the quality of individual entities.
+
+This document describes the theoretical frameworks drawn from ontology
+engineering, formal concept analysis, semiotic quality theory, and DSL
+design — and how each is adapted to work within MarkiTect's two-layer
+evaluation model (LLM-Eval + deterministic aggregation).
+
+---
+
+## 1. The Two-Layer Model
+
+Every metric in this methodology decomposes into two layers:
+
+| Layer | What it does | How it runs |
+|-------|-------------|-------------|
+| **LLM-Eval** | Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" | Prompt template → LLM → structured YAML output |
+| **Deterministic** | Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection | Python code in `process_chapters.py` or dedicated `metrics.py` |
+
+The LLM-Eval layer produces **per-entity** or **per-pair** structured
+scores. The deterministic layer **aggregates** these into collection-level
+metrics, persisted as machine-readable YAML alongside human-readable
+markdown reports.
+
+Per-concept quality metrics (definition precision, source grounding, VSM
+relevance — see INFRA-TASKS 8-12) operate at the individual entity level.
+This document covers the five **collection-level concerns** that assess how
+the entities work together as an explanatory system.
+
+---
+
+## 2. Five Collection-Level Concerns
+
+### Overview
+
+| # | Concern | Question | Primary framework |
+|---|---------|----------|-------------------|
+| C1 | Semantic Overlap | Are there redundant concepts? | OOPS! P2, embedding similarity |
+| C2 | Coverage Completeness | Does the concept set cover the domain? | SEQUAL, FCA |
+| C3 | Structural Coherence | Do concepts form a connected explanatory graph? | OntoQA, graph theory |
+| C4 | Definitional Consistency | Are concepts defined consistently and non-circularly? | OntoClean, OOPS! P24 |
+| C5 | Granularity Balance | Are concepts at comparable levels of abstraction? | Granularity theory, DSL laconicity |
+
+---
+
+## 3. Theoretical Frameworks
+
+### 3.1 SEQUAL (Semiotic Quality Framework)
+
+**Origin:** Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.
+
+**What it defines:** Quality of a conceptual model as the correspondence
+between three worlds — the domain (what exists), the model (what we
+captured), and the audience's interpretation (what they understand).
+
+Two key dimensions of **semantic quality**:
+
+- **Validity** — everything in the model corresponds to something real
+  in the domain. No invented concepts.
+- **Completeness** — everything relevant in the domain is represented in
+  the model. No missing concepts.
+
+**How we use it:** SEQUAL frames our entire metrics approach. Every
+collection-level metric maps to one of these dimensions:
+
+| SEQUAL dimension | Our concerns |
+|-----------------|--------------|
+| Validity | C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid) |
+| Completeness | C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps) |
+| Both | C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity]) |
+
+**Adaptation:** SEQUAL was designed for formal models evaluated by human
+experts. We replace human judgment with LLM-Eval (for validity checks like
+"does this concept correspond to something Smith actually described?") and
+deterministic counting (for completeness checks like "which VSM systems
+lack entity mappings?").
+
+### 3.2 OntoClean
+
+**Origin:** Guarino & Welty (2004).
+
+**What it defines:** A methodology for validating taxonomic relationships
+by assigning **metaproperties** to each concept:
+
+- **Rigidity** — Is the property essential to all its instances? (e.g.
+  "market" is rigid; "effectual demander" is anti-rigid — an agent can
+  stop being an effectual demander)
+- **Identity** — Does the concept carry an identity criterion? (e.g.
+  "division of labour" can be identified by its three causal mechanisms)
+- **Unity** — Are all instances of this concept whole in the same way?
+- **Dependence** — Does the concept require another concept to exist?
+  (e.g. "market price" depends on "effectual demand")
+
+**Constraint:** A rigid concept cannot be subsumed by an anti-rigid one.
+Violations indicate structural confusion.
+
+**How we use it:** We do not have a formal taxonomy, but our flat entity
+set implicitly contains subsumption relationships (e.g. "natural rate"
+subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:
+
+- **Granularity mismatches** (C5): A rigid concept at the same level as
+  an anti-rigid one suggests different abstraction levels are mixed.
+- **Definitional consistency** (C4): If entity A depends on entity B per
+  OntoClean, but B's definition doesn't acknowledge A, the definitions
+  are inconsistent.
+- **Redundancy** (C1): Two entities with identical metaproperty profiles
+  and overlapping definitions are candidates for merging.
+
+**Adaptation:** Instead of manual metaproperty assignment, we use LLM-Eval
+to classify each entity's rigidity, identity criterion, and dependencies.
+The constraint checking is then deterministic.
+
+### 3.3 OOPS! (Ontology Pitfall Scanner)
+
+**Origin:** Poveda-Villalón et al. (2014). Catalogue of 41 common
+ontology design pitfalls.
+
+**What it defines:** Concrete, testable anti-patterns. The pitfalls most
+relevant to our infospace:
+
+| Pitfall | Description | Our concern |
+|---------|-------------|-------------|
+| P2 | Synonymous classes — different names, same meaning | C1 (redundancy) |
+| P4 | Unconnected ontology elements | C3 (coherence) |
+| P6 | Missing inverse relationships | C3 |
+| P7 | Merging different concepts in the same class | C5 (granularity — too coarse) |
+| P11 | Missing domain or range | C4 (consistency) |
+| P19 | Missing disjointness axioms | C1 (how do we know two concepts don't overlap?) |
+| P24 | Recursive/circular definition | C4 (consistency) |
+| P25 | Inverse of itself | C4 |
+
+**How we use it:** OOPS! pitfalls become a **checklist for LLM-Eval
+prompts**. Rather than running a formal OWL scanner, we ask the LLM to
+check for each pitfall pattern:
+
+- "Are entities A and B synonymous?" (P2)
+- "Does entity A's definition reference itself?" (P24)
+- "Is entity A actually two distinct concepts merged together?" (P7)
+
+The deterministic layer counts pitfall occurrences and tracks them over
+time.
+
+**Adaptation:** We select the subset of OOPS! pitfalls applicable to
+semi-formal markdown-based ontologies (no OWL axioms) and implement each
+as an LLM-Eval prompt pattern rather than a formal reasoner check.
+
+### 3.4 OntoQA (Metric-Based Ontology Quality Analysis)
+
+**Origin:** Tartir & Arpinar (2007).
+
+**What it defines:** Quantitative schema-level and instance-level metrics:
+
+- **Relationship Richness (RR):** Proportion of non-taxonomic (lateral)
+  relationships to total relationships. `RR = non_hierarchical / total`.
+  Low RR = mere taxonomy. High RR = rich cross-cutting connections.
+- **Attribute Richness (AR):** Average number of attributes per concept.
+  `AR = total_attributes / total_concepts`.
+- **Inheritance Richness (IR):** Average subclasses per class — measures
+  how knowledge distributes across the hierarchy.
+- **Class Richness (CR):** Proportion of classes with instances.
+
+**How we use it:** Our entities don't have formal relationships declared
+between them, but we can **infer** a relationship graph from their
+definitions and mappings:
+
+- Entity A references entity B in its definition → definitional dependency
+- Entities A and B map to the same VSM system → structural co-occurrence
+- Entities A and B appear in the same chapter → contextual co-occurrence
+
+From this inferred graph, we compute OntoQA metrics directly:
+
+- **Relationship Richness** tells us whether our concepts form a web of
+  explanatory connections or just a flat list.
+- **Attribute Richness** maps to our schema sections — entities with more
+  optional sections filled (Original Wording, Modern Interpretation) are
+  richer.
+
+**Adaptation:** The key modification is that relationship inference is an
+LLM-Eval step (pairwise: "does A's definition depend on or reference B?"),
+after which all OntoQA metrics are computed deterministically on the
+resulting graph.
+
+### 3.5 Formal Concept Analysis (FCA)
+
+**Origin:** Wille (1982). Applied to ontology auditing by Elhaj et al.
+(2008) for SNOMED CT completeness checking.
+
+**What it defines:** A mathematical framework for deriving a **concept
+lattice** from a binary relation between objects and attributes. The
+lattice reveals:
+
+- **Formal concepts**: maximal sets of objects sharing the same attributes
+- **Subconcept/superconcept** relationships: the natural hierarchy
+- **Missing concepts**: attribute combinations with no corresponding object
+
+**How we use it:** We construct a **formal context** (binary matrix):
+
+- **Objects** = our 85 entities
+- **Attributes** = economic domain, VSM system, source book, abstraction
+  level (from LLM-Eval), key terms (extracted from definitions)
+
+The concept lattice then reveals:
+
+- **Coverage gaps** (C2): Attribute combinations with no entity. E.g. if
+  the cell {Distribution, S3} is empty, we lack control-layer concepts
+  for distribution — a specific, actionable gap.
+- **Redundancy** (C1): Entities with identical attribute sets (same formal
+  concept) are candidates for merging.
+- **Granularity** (C5): The lattice depth indicates how many meaningful
+  levels of abstraction exist. A shallow lattice suggests missing
+  intermediate concepts.
+
+**Adaptation:** Classic FCA requires crisp binary attributes. Our domains
+and VSM mappings are already categorical, but abstraction level and key
+terms need LLM-Eval to produce. The lattice computation itself is
+deterministic (Python `concepts` library or equivalent). The FCA approach
+replaces the current "ask the LLM about coverage" with a structural
+computation that can identify *specific* gaps rather than vague
+recommendations.
+
+### 3.6 DSL Design Principles
+
+**Origin:** Mernik et al. (2005) "When and How to Develop DSLs";
+Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".
+
+**What they define:** Quality criteria for a set of concepts that form a
+language for a specific domain:
+
+- **Soundness**: Every concept in the language corresponds to a real domain
+  concern (no invented abstractions).
+- **Completeness**: The language can express everything needed for its
+  intended tasks.
+- **Laconicity**: No unnecessary concepts — every concept earns its place.
+- **Orthogonality**: Concepts are independent; combining any two produces
+  a meaningful result (no redundant combinations).
+
+**How we use it:** Our entity set is effectively a domain-specific
+vocabulary for "explaining classical economics through VSM". DSL quality
+criteria translate directly:
+
+- **Soundness** → Validity (SEQUAL): every entity grounded in Smith's text
+- **Completeness** → Coverage (C2): can we answer the "competency
+  questions" the infospace is meant to address?
+- **Laconicity** → Anti-redundancy (C1) + Indispensability (C5): would
+  removing any entity lose explanatory power?
+- **Orthogonality** → Non-overlap (C1): entity definitions don't
+  substantially duplicate each other
+
+**Adaptation:** We operationalise DSL completeness through **competency
+questions** — a set of canonical questions the infospace should be able to
+answer (e.g. "How does the division of labour relate to market extent?",
+"What mechanisms regulate wages toward their natural rate?"). LLM-Eval
+tests whether the current entity set suffices to answer each question.
+Unanswerable questions identify specific completeness gaps.
+
+Laconicity is operationalised as **indispensability scoring**: for each
+entity, LLM-Eval rates whether removing it would lose explanatory power.
+Low-scoring entities are candidates for merging or retirement.
+
+---
+
+## 4. Integration: Metric Definitions by Concern
+
+### C1: Semantic Overlap / Redundancy
+
+**Goal:** Identify entities that substantially overlap in meaning and
+should be merged, distinguished, or retired.
+
+**Metrics:**
+
+| Metric | Type | Computation |
+|--------|------|-------------|
+| `similarity_matrix` | Deterministic | Embed all entity definitions; compute NxN cosine similarity |
+| `high_similarity_pairs` | Deterministic | Pairs with cosine > 0.80, sorted descending |
+| `confirmed_synonyms` | LLM-Eval | For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap" |
+| `redundancy_ratio` | Deterministic | `confirmed_synonyms / total_entities` |
+| `intensional_conciseness` | Deterministic | `1 - redundancy_ratio` (from KG quality framework) |
+
+**Pipeline:**
+1. Embed definitions (embedding API or local model)
+2. Compute cosine similarity matrix
+3. Filter pairs above threshold
+4. LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
+5. Aggregate into ratio and conciseness score
+
+**Output:** `output/metrics/redundancy-report.md` + structured YAML with
+pair list, scores, and merge/retire recommendations.
+
+### C2: Coverage Completeness
+
+**Goal:** Identify domain areas and VSM systems that lack adequate
+representation in the entity set.
+
+**Metrics:**
+
+| Metric | Type | Computation |
+|--------|------|-------------|
+| `domain_vsm_matrix` | Deterministic | Count entities per {economic_domain, VSM_system} cell |
+| `coverage_ratio` | Deterministic | `populated_cells / expected_cells` |
+| `vsm_balance_entropy` | Deterministic | Shannon entropy of entity distribution across VSM systems (higher = more balanced) |
+| `empty_cells` | Deterministic | List of {domain, VSM_system} pairs with zero entities |
+| `competency_coverage` | LLM-Eval | For each competency question, can it be answered with current entities? |
+| `fca_gap_concepts` | Deterministic | Attribute combinations in the FCA lattice with no corresponding entity |
+
+**Pipeline:**
+1. Parse entity metadata (domain, VSM mapping) from files on disk
+2. Build domain × VSM matrix; identify empty cells
+3. Build FCA formal context; compute lattice; extract gap concepts
+4. Define competency questions (initially hand-written, later LLM-generated
+   from the source material)
+5. LLM-evaluate answerability of each question
+6. Aggregate into coverage ratio, entropy, and gap list
+
+**Output:** `output/metrics/coverage-report.md` + YAML with matrix, gaps,
+and competency question results.
+
+### C3: Structural Coherence
+
+**Goal:** Determine whether the entities form a connected explanatory web
+or a fragmented collection of isolated concepts.
+
+**Metrics:**
+
+| Metric | Type | Computation |
+|--------|------|-------------|
+| `relationship_graph` | LLM-Eval + Deterministic | Infer edges from definition cross-references (string matching) + LLM judgment for implicit references |
+| `connected_components` | Deterministic | Number of connected components in the graph (target: 1) |
+| `graph_density` | Deterministic | `actual_edges / possible_edges` |
+| `avg_degree` | Deterministic | `total_edges / total_entities` |
+| `relationship_richness` | Deterministic | OntoQA RR: `non_hierarchical_edges / total_edges` |
+| `modularity` | Deterministic | Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation) |
+| `bridge_concepts` | Deterministic | Entities with highest betweenness centrality (connect clusters) |
+| `orphan_entities` | Deterministic | Entities with degree 0 or 1 |
+| `cohesion_by_domain` | Deterministic | Avg intra-domain edges per entity |
+| `coupling_across_domains` | Deterministic | Inter-domain edges / total edges |
+
+**Pipeline:**
+1. Extract explicit cross-references from definitions (entity name
+   mentions in other definitions — string matching with slug normalisation)
+2. For entity pairs not caught by string matching, LLM-Eval: "Does A's
+   definition depend on or reference B's concept?"
+3. Build directed graph
+4. Compute graph metrics (networkx or equivalent)
+5. Run community detection; compare detected communities to declared
+   economic domains
+
+**Output:** `output/metrics/coherence-report.md` + YAML with graph
+statistics, orphan list, bridge concepts, and community structure.
+
+### C4: Definitional Consistency
+
+**Goal:** Ensure entities are defined consistently, non-circularly, and
+without contradicting each other.
+
+**Metrics:**
+
+| Metric | Type | Computation |
+|--------|------|-------------|
+| `definitional_dependency_graph` | Deterministic + LLM-Eval | Edges where A's definition uses B's concept |
+| `circular_definitions` | Deterministic | Cycles of length ≤ 3 in the dependency graph |
+| `definition_depth` | Deterministic | Longest dependency chain per entity before reaching a term not in the entity set |
+| `undefined_dependencies` | Deterministic | Terms used in definitions that arguably should be entities but aren't |
+| `pairwise_consistency` | LLM-Eval | For related entity pairs (sharing edges): "Do these definitions contradict each other?" |
+| `source_fidelity` | LLM-Eval | "Does this definition accurately represent what Smith wrote in the cited passage?" |
+| `metaproperty_violations` | LLM-Eval + Deterministic | OntoClean constraint checking after LLM classifies rigidity/identity |
+| `grounding_ratio` | Deterministic | Fraction of entities traceable to primitives without cycles |
+
+**Pipeline:**
+1. Build definitional dependency graph (same technique as C3, but directed
+   — A depends on B means A's definition uses B, not vice versa)
+2. Detect cycles; flag short cycles
+3. Extract undefined terms (terms matching entity-name patterns that appear
+   in definitions but have no corresponding entity file)
+4. LLM pairwise consistency check on directly-connected pairs
+5. LLM source fidelity check (compare definition to source chapter text)
+6. LLM OntoClean metaproperty classification; deterministic constraint
+   checking
+
+**Output:** `output/metrics/consistency-report.md` + YAML with cycle list,
+undefined terms, contradiction candidates, and metaproperty violations.
+
+### C5: Granularity Balance
+
+**Goal:** Ensure entities operate at comparable levels of abstraction
+within their respective domains and perspectives.
+
+**Metrics:**
+
+| Metric | Type | Computation |
+|--------|------|-------------|
+| `abstraction_classification` | LLM-Eval | Classify each entity as theory-level / mechanism-level / observation-level |
+| `scope_score` | LLM-Eval | Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle) |
+| `abstraction_distribution` | Deterministic | Count per level; compute entropy |
+| `scope_variance` | Deterministic | Variance of scope scores within each domain |
+| `level_x_perspective_matrix` | Deterministic | Cross-tabulation of abstraction level × economic domain |
+| `indispensability` | LLM-Eval | "If removed, what explanatory power is lost?" (1-5) |
+| `dispensable_entities` | Deterministic | Entities with indispensability score ≤ 2 |
+| `merge_candidates` | LLM-Eval | Pairs where one is a sub-case of the other |
+
+**Pipeline:**
+1. LLM-classify each entity: abstraction level, scope score,
+   indispensability
+2. Build level × perspective matrix
+3. Compute distribution entropy and per-domain scope variance
+4. Flag outliers: entities whose scope score deviates > 1.5σ from their
+   domain mean
+5. For outlier entities, LLM-Eval: "Should this be merged into a broader
+   concept, or split into sub-concepts?"
+
+**Output:** `output/metrics/granularity-report.md` + YAML with
+classifications, distribution, outliers, and merge/split recommendations.
+
+---
+
+## 5. Shared Infrastructure
+
+Several concerns share underlying computations:
+
+| Infrastructure | Used by | Build once |
+|---------------|---------|------------|
+| Definition embeddings (vector per entity) | C1, C3 | Embedding API call per entity |
+| Relationship graph (entity → entity edges) | C3, C4 | String matching + LLM-Eval |
+| FCA formal context (entity × attribute matrix) | C2, C5 | Metadata parsing + LLM classification |
+| Entity metadata index (domain, VSM, chapter, sections) | C2, C5, C10 (schema compliance) | Deterministic markdown parsing |
+
+These should be computed once per evaluation run and cached for use by
+all concern-specific metrics.
+
+---
+
+## 6. Evaluation Workflow
+
+A full collection-level evaluation run:
+
+```
+process_chapters.py --evaluate-collection --provider <provider>
+```
+
+1. **Parse** — deterministic metadata extraction from all entity files
+2. **Embed** — compute definition embeddings (cached; only new/changed
+   entities need fresh embeddings)
+3. **Infer** — LLM-Eval for relationship edges, metaproperties,
+   abstraction levels, pairwise judgments (batched to minimise LLM calls)
+4. **Compute** — deterministic graph metrics, FCA lattice, coverage
+   matrix, similarity matrix, cycle detection
+5. **Aggregate** — combine per-entity and per-pair scores into
+   collection-level metrics
+6. **Report** — write per-concern markdown reports + unified `metrics.yaml`
+7. **Append** — add timestamped snapshot to `metrics-history.yaml`
+
+Incremental mode (`--evaluate-collection --chapter <id>`) re-evaluates
+only the entities introduced or modified by that chapter, plus any
+pairwise checks involving those entities.
+
+---
+
+## 7. References
+
+- Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding
+  Quality in Conceptual Modeling." *IEEE Software* 11(2), 42-49.
+  → SEQUAL framework: validity and completeness dimensions.
+
+- Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In
+  *Handbook on Ontologies*, Springer, 151-171.
+  → Metaproperty analysis: rigidity, identity, unity, dependence.
+
+- Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014).
+  "OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology
+  Evaluation." *IJSWIS* 10(2), 7-34.
+  → Pitfall catalogue: 41 anti-patterns for ontology design.
+
+- Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking
+  using OntoQA." *ICSC 2007*, IEEE, 185-192.
+  → Schema metrics: relationship richness, attribute richness.
+
+- Wille, R. (1982). "Restructuring Lattice Theory." In *Ordered Sets*,
+  Reidel, 445-470.
+  → Formal Concept Analysis: concept lattices from binary contexts.
+
+- Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept
+  Analysis." *AMIA Annual Symposium*, PMC2605587.
+  → FCA for ontology completeness auditing.
+
+- Keet, C.M. (2008). *A Formal Theory of Granularity.* PhD thesis,
+  Free University of Bozen-Bolzano.
+  → Granularity levels and perspectives for ontology design.
+
+- Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to
+  Develop Domain-Specific Languages." *ACM Computing Surveys* 37(4),
+  316-344.
+  → DSL design: soundness, completeness, laconicity.
+
+- Karsai, G. et al. (2014). "Design Guidelines for Domain Specific
+  Languages." *arXiv:1409.2378*.
+  → Orthogonality, necessary-and-sufficient principle.
+
+- Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A
+  Comprehensive Survey." *IEEE TKDE* 35(5), 4969-4988.
+  → KG quality dimensions: conciseness, consistency, completeness.
--- a/roadmap/infospace-tooling/PLAN.md
+++ b/roadmap/infospace-tooling/PLAN.md
@@ -0,0 +1,621 @@
+# Viable Infospace Tooling — Roadmap
+
+## Vision
+
+An **infospace** is a structured, evaluable, composable collection of
+concepts that explains a **topic** through the lens of one or more
+**disciplines**. Infospaces are the unit of knowledge work in MarkiTect.
+
+This roadmap organises the work needed to move from the current
+ad-hoc example (`infospace-with-history`) to a general-purpose platform
+for creating, evaluating, maintaining, and composing infospaces.
+
+---
+
+## Terminology
+
+These terms establish the vocabulary for infospace tooling. They
+generalise from the Wealth of Nations / VSM example but are not
+specific to it.
+
+### Infospace
+
+A curated, self-describing collection of **entities** (concepts,
+mechanisms, observations) that together explain a **topic**. An
+infospace has:
+
+- A **topic** — the subject matter being explained (e.g. "The Wealth
+  of Nations", "cellular biology", "Kubernetes networking")
+- One or more **disciplines** — external frameworks applied as lenses
+  (e.g. "Viable System Model", "category theory")
+- **Entities** — the atomic units of knowledge, each with a definition,
+  provenance, and quality scores
+- **Schemas** — structural templates that define what a well-formed
+  entity, mapping, or analysis looks like
+- **Evaluations** — per-entity and collection-level quality assessments
+- **Metrics** — quantitative indicators of completeness, coherence,
+  consistency, and granularity balance
+
+An infospace is **viable** when it meets threshold scores across its
+defined metrics — it is fit for purpose as an explanatory tool.
+
+### Topic
+
+The subject matter an infospace is built to explain. A topic sits
+within a **domain** (broader field of knowledge) but is more specific:
+
+- Domain: Economics → Topic: The Wealth of Nations
+- Domain: Systems Theory → Topic: Viable System Model
+- Domain: Computer Science → Topic: Distributed consensus protocols
+
+A topic provides the **source material** — the texts, data, or
+observations from which entities are extracted.
+
+### Discipline
+
+A reusable framework of concepts applied as a lens to explore a topic.
+A discipline is itself an infospace — one that has been evaluated as
+viable and packaged for reuse.
+
+In our example, the VSM is the discipline: a set of concepts (S1-S5,
+recursion, variety, viability) from systems theory, applied to the
+economic concepts in Smith's work.
+
+**Key property:** Disciplines compose. An infospace built with one
+discipline can itself become a discipline for another infospace. The
+Wealth of Nations infospace, viewed through VSM, could become a
+discipline applied to a modern supply chain analysis.
+
+### Entity
+
+The atomic unit of an infospace. An entity has:
+
+- **Identity**: a unique slug and human-readable title
+- **Definition**: a precise, non-circular explanation
+- **Provenance**: the source chapter, passage, and extraction context
+- **Domain placement**: which area of the topic it belongs to
+- **Discipline mapping**: how it connects to the applied discipline
+  (e.g. which VSM system)
+- **Quality scores**: per-entity LLM-evaluated metrics
+- **Lifecycle state**: active, archived (with reason), or draft
+
+### Evaluation
+
+A structured assessment of quality, applied at two levels:
+
+- **Per-entity evaluation**: scores an individual entity against
+  quality rubrics defined in its schema (definition precision, source
+  grounding, discipline relevance, etc.)
+- **Collection evaluation**: scores the entity set as a whole against
+  five concerns: redundancy, coverage, coherence, consistency, and
+  granularity balance
+
+Evaluations are always performed by **delegated LLM calls** through
+MarkiTect's LLM integration — never by the coding agent working on
+infrastructure. This separation ensures that domain-level judgment
+stays in the problem space, not the tooling space.
+
+### Viability
+
+An infospace is viable when:
+
+1. Its entities individually meet quality thresholds (per-entity eval)
+2. Its collection metrics are within acceptable ranges
+3. It can answer its defined **competency questions** — the canonical
+   queries the infospace is meant to support
+4. It has been evaluated recently enough that metrics reflect current
+   content
+
+Viability is not binary — it is a profile of scores that the user
+sets thresholds for based on their needs.
+
+---
+
+## Architecture: Three Layers
+
+```
+┌──────────────────────────────────────────────────┐
+│  Layer 3: Infospace Instances                    │
+│  Specific infospaces built by users              │
+│  (Wealth of Nations + VSM, supply chain + ...)   │
+│  Works IN an infospace                           │
+├──────────────────────────────────────────────────┤
+│  Layer 2: Infospace Tooling                      │
+│  Terminology, primitives, composition model      │
+│  CLI: infospace create/evaluate/compose/...      │
+│  Works WITH infospaces                           │
+├──────────────────────────────────────────────────┤
+│  Layer 1: MarkiTect Platform                     │
+│  Artifacts, prompts, LLM, spaces, graph, embed   │
+│  Provides FOR infospaces                         │
+└──────────────────────────────────────────────────┘
+```
+
+### Boundary condition: LLM delegation
+
+All LLM-based evaluation (entity scoring, pairwise judgments, coverage
+analysis) is delegated to MarkiTect's LLM integration module. The coding
+agent that works on infrastructure never makes domain-level judgments
+itself. This keeps a clean separation:
+
+- **Coding agent** → writes Python, templates, schemas, tests
+- **MarkiTect LLM** → evaluates entities, judges redundancy, assesses
+  coverage, checks consistency
+
+The infospace tooling (Layer 2) orchestrates these LLM calls through
+prompt templates and the prompt execution engine, not through ad-hoc
+prompting.
+
+---
+
+## Stage 1: MarkiTect Platform Additions
+
+Infrastructure that must exist before infospace tooling can be built.
+These are general-purpose platform capabilities, not infospace-specific.
+
+### S1.1 — Entity metadata parser
+
+Add a deterministic markdown parser that extracts structured metadata
+from entity files: H1 title, sections present, word counts, domain,
+source chapter. Returns a dataclass usable by all downstream metrics.
+
+**Maps to:** INFRA-TASKS #13, #10
+**Location:** `markitect/prompts/quality/` or new `markitect/analysis/`
+**Depends on:** Nothing — can start immediately
+**Deliverable:** `parse_entity_metadata(path) -> EntityMeta` function
+with tests
+
+### S1.2 — Schema compliance validator
+
+Deterministic validation of entity/mapping files against their schemas:
+section presence, word count ranges, heading format, enum values. No
+LLM needed.
+
+**Maps to:** INFRA-TASKS #10
+**Location:** `markitect/prompts/quality/validator.py` (extend existing)
+**Depends on:** S1.1
+**Deliverable:** `validate_document(path, schema) -> ValidationResult`
+with tests
+
+### S1.3 — Embedding adapter
+
+Add embedding support to `markitect/llm/`. Needs:
+
+- `EmbeddingAdapter` interface: `embed(texts: list[str]) -> list[list[float]]`
+- `OpenRouterEmbeddingAdapter` implementation (or OpenAI embedding endpoint)
+- Caching layer: store embeddings keyed by `{slug: content_digest}` so
+  unchanged entities skip re-embedding
+- Cosine similarity utility: `similarity_matrix(embeddings) -> np.ndarray`
+
+**Maps to:** INFRA-TASKS #14 (prerequisite)
+**Location:** `markitect/llm/embeddings.py`
+**Depends on:** Nothing — can start immediately
+**Deliverable:** Embedding adapter + cache + similarity computation, with
+tests
+
+### S1.4 — Graph analysis utilities
+
+The existing `DependencyGraph` supports basic traversal and cycle
+detection. Collection-level metrics need richer analysis:
+
+- Connected components
+- Betweenness centrality
+- Community detection (Louvain or label propagation)
+- Modularity score
+- Degree distribution
+- Cohesion/coupling computation
+
+Decide: extend `DependencyGraph` or add a lightweight wrapper that
+converts to networkx (adding it as an optional dependency).
+
+**Maps to:** INFRA-TASKS #16 (prerequisite)
+**Location:** `markitect/prompts/dependencies/analysis.py` or new
+`markitect/analysis/graph.py`
+**Depends on:** Nothing — can start immediately
+**Deliverable:** Graph analysis functions with tests
+
+### S1.5 — Structured evaluation output
+
+Define a standard format for evaluation results: YAML front-matter +
+markdown body. Add utilities for:
+
+- Writing evaluation results (per-entity, per-pair, collection-level)
+- Reading/parsing evaluation results back into dataclasses
+- Appending timestamped snapshots to a history file
+- Diffing two snapshots
+
+**Maps to:** INFRA-TASKS #11, #12
+**Location:** `markitect/prompts/quality/` or `markitect/analysis/`
+**Depends on:** S1.1
+**Deliverable:** `EvaluationResult` model + read/write utilities with
+tests
+
+### S1.6 — Batch LLM evaluation orchestrator
+
+A pipeline component that runs an evaluation prompt template against a
+batch of entities (or entity pairs), collecting structured results.
+Must handle:
+
+- Rate limiting and retry (reuse existing adapter logic)
+- Progress reporting
+- Incremental evaluation (skip entities whose content hasn't changed
+  since last eval)
+- Result aggregation
+
+This is the mechanism by which infospace tooling delegates LLM work
+to the platform.
+
+**Maps to:** INFRA-TASKS #9 (prerequisite)
+**Location:** `markitect/prompts/execution/batch.py`
+**Depends on:** S1.5
+**Deliverable:** `BatchEvaluator` class with tests
+
+### S1.7 — FCA computation
+
+Formal Concept Analysis: build a formal context (entity × attribute
+matrix), compute the concept lattice, extract gap concepts. Either
+implement a minimal FCA algorithm or integrate a library.
+
+**Maps to:** INFRA-TASKS #15 (prerequisite)
+**Location:** `markitect/analysis/fca.py`
+**Depends on:** S1.1
+**Deliverable:** `FormalContext`, `ConceptLattice`, `find_gap_concepts()`
+with tests
+
+### Summary: Stage 1 dependency graph
+
+```
+S1.1 Entity metadata parser ──┬── S1.2 Schema validator
+                               ├── S1.5 Eval output format ── S1.6 Batch evaluator
+                               └── S1.7 FCA computation
+
+S1.3 Embedding adapter ──────── (independent)
+S1.4 Graph analysis ─────────── (independent)
+```
+
+S1.1, S1.3, and S1.4 can proceed in parallel. S1.6 (batch evaluator) is
+the final piece needed before Stage 2 can begin.
+
+---
+
+## Stage 2: Infospace Tooling
+
+The user-facing layer that provides documented primitives for working
+with infospaces. Built on top of Stage 1 infrastructure and the existing
+`markitect/spaces/` module.
+
+### S2.1 — Infospace model and configuration
+
+Define the `Infospace` as a first-class concept that extends the existing
+`InformationSpace` with:
+
+- **Topic declaration**: name, domain, source material reference
+- **Discipline bindings**: which external infospaces are applied as lenses
+- **Schema registry**: which schemas govern entity structure
+- **Competency questions**: what the infospace should be able to answer
+- **Viability thresholds**: minimum acceptable metric scores
+- **Evaluation state**: latest per-entity and collection scores
+
+Configuration format: a `infospace.yaml` (or section in existing config)
+that declares all of the above.
+
+**Location:** new `markitect/infospace/` package
+**Depends on:** S1.1, S1.5, existing `markitect/spaces/`
+**Deliverable:** `InfospaceConfig`, `InfospaceState` models + loader
+
+### S2.2 — Infospace lifecycle commands
+
+CLI commands for the core lifecycle:
+
+```bash
+# Initialise a new infospace
+markitect infospace init --topic "Wealth of Nations" \
+  --domain "Economics" \
+  --discipline vsm-framework
+
+# Show infospace status (entity count, eval state, viability)
+markitect infospace status
+
+# List entities with quality summary
+markitect infospace entities [--sort-by score|domain|chapter]
+
+# Show viability dashboard
+markitect infospace viability
+```
+
+These commands read the `infospace.yaml` config and present information
+from the metadata index and evaluation results.
+
+**Location:** `markitect/infospace/cli.py` integrated into main CLI
+**Depends on:** S2.1
+**Deliverable:** CLI commands with help text and tests
+
+### S2.3 — Per-entity evaluation primitives
+
+Prompt templates and CLI commands for evaluating individual entities:
+
+```bash
+# Evaluate all entities
+markitect infospace evaluate --provider openrouter
+
+# Evaluate entities from a specific chapter
+markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter
+
+# Re-evaluate a single entity
+markitect infospace evaluate --entity division-of-labour --provider openrouter
+```
+
+Uses the batch evaluator (S1.6) to run the evaluate-entity prompt
+template (defined in the infospace's schema directory) against entities.
+Writes structured results to `output/evaluations/`.
+
+**Maps to:** INFRA-TASKS #8, #9
+**Location:** `markitect/infospace/evaluation.py`
+**Depends on:** S1.6, S2.1
+**Deliverable:** Per-entity evaluation pipeline + CLI + prompt template
+
+### S2.4 — Collection-level checks
+
+CLI commands for each of the five collection concerns:
+
+```bash
+# Run all collection checks
+markitect infospace check --provider openrouter
+
+# Run specific checks
+markitect infospace check redundancy --provider openrouter
+markitect infospace check coverage --provider openrouter
+markitect infospace check coherence --provider openrouter
+markitect infospace check consistency --provider openrouter
+markitect infospace check granularity --provider openrouter
+```
+
+Each check uses Stage 1 infrastructure (embeddings, graph analysis, FCA)
+and delegates LLM judgment to the platform. Results written to
+`output/metrics/` as per-concern reports + unified `metrics.yaml`.
+
+**Maps to:** INFRA-TASKS #14-19
+**Location:** `markitect/infospace/checks/` (one module per concern)
+**Depends on:** S1.3, S1.4, S1.6, S1.7, S2.1
+**Deliverable:** Five check modules + unified orchestrator + CLI
+
+### S2.5 — Metrics history and viability tracking
+
+Track metrics over time. After each evaluation or check run, append a
+timestamped snapshot to `metrics-history.yaml`. Provide commands to
+review trends:
+
+```bash
+# Show metrics history
+markitect infospace history
+
+# Compare two snapshots
+markitect infospace history diff 2026-02-18 2026-03-01
+
+# Check viability against thresholds
+markitect infospace viability
+```
+
+Viability is assessed by comparing current metrics to the thresholds
+declared in `infospace.yaml`. A simple pass/fail per metric with the
+actual value.
+
+**Maps to:** INFRA-TASKS #12
+**Location:** `markitect/infospace/history.py`
+**Depends on:** S2.4, S1.5
+**Deliverable:** History tracking + viability assessment + CLI
+
+### S2.6 — Infospace composition model
+
+The mechanism by which one infospace is applied as a discipline to
+another. Builds on `markitect/spaces/composability/`:
+
+- **Discipline binding**: declare that infospace A uses infospace B as a
+  discipline. B's entities become available as mapping targets.
+- **Cross-infospace references**: entity in A maps to concept in B using
+  the same mapping schema and evaluation pipeline.
+- **Discipline viability requirement**: B must be viable (meets its own
+  thresholds) before it can be used as a discipline for A.
+- **Cascading evaluation**: when B's entities change, A's mappings that
+  reference them are flagged for re-evaluation.
+
+```bash
+# Bind a discipline to the current infospace
+markitect infospace bind-discipline ./path/to/vsm-infospace
+
+# List bound disciplines and their viability
+markitect infospace disciplines
+
+# Check for stale mappings after discipline update
+markitect infospace check stale-mappings
+```
+
+**Location:** `markitect/infospace/composition.py`
+**Depends on:** S2.1, existing `markitect/spaces/composability/`
+**Deliverable:** Composition model + CLI + documentation
+
+### S2.7 — Documentation: Infospace Primitives Reference
+
+A reference document explaining all primitives, their purpose, and how
+they compose. This is the user-facing documentation for the infospace
+tooling layer — the equivalent of a framework guide.
+
+**Location:** `docs/infospace-primitives.md` or in-CLI help
+**Depends on:** S2.1-S2.6
+**Deliverable:** Reference documentation
+
+### Summary: Stage 2 dependency graph
+
+```
+S2.1 Model & config ──┬── S2.2 Lifecycle CLI
+                       ├── S2.3 Per-entity evaluation
+                       ├── S2.4 Collection checks ── S2.5 History & viability
+                       └── S2.6 Composition model
+
+S2.7 Documentation (depends on all above)
+```
+
+---
+
+## Stage 3: Example Revision
+
+Revisit the Wealth of Nations / VSM example using the new tooling.
+The example becomes both a tutorial and a validation of the tooling.
+
+### S3.1 — Migrate example to infospace configuration
+
+Replace the ad-hoc `process_chapters.py` setup with a declarative
+`infospace.yaml`:
+
+```yaml
+topic:
+  name: "The Wealth of Nations"
+  domain: "Classical Economics"
+  sources: artifacts/sources/
+
+disciplines:
+  - name: "Viable System Model"
+    path: artifacts/vsm-reference/
+
+schemas:
+  entity: schemas/economic-entity-schema-v1.0.md
+  mapping: schemas/vsm-mapping-schema-v1.0.md
+  analysis: schemas/chapter-analysis-schema-v1.0.md
+
+competency_questions: schemas/competency-questions.md
+
+viability:
+  redundancy_ratio: { max: 0.05 }
+  coverage_ratio: { min: 0.60 }
+  coherence_components: { max: 1 }
+  consistency_cycles: { max: 0 }
+  granularity_entropy: { min: 1.0 }
+  per_entity_mean: { min: 3.5 }
+
+pipeline:
+  stages:
+    - template: extract-entities
+      spaces: [sources, guidelines, vsm-reference, entities]
+    - template: map-to-vsm
+      spaces: [entities, vsm-reference, guidelines]
+    - template: synthesize-analysis
+      spaces: [sources, entities, mappings, vsm-reference]
+  post_batch:
+    - template: assess-metrics
+      spaces: [analyses, vsm-reference]
+```
+
+**Depends on:** S2.1
+**Deliverable:** `infospace.yaml` + migration of `process_chapters.py` to
+use infospace tooling APIs
+
+### S3.2 — Clean per-chapter git history
+
+Re-run all processed chapters (and remaining ones) with per-chapter
+commits on a clean branch, then replace the current tangled history.
+
+**Maps to:** INFRA-TASKS #4, #7
+**Depends on:** S3.1
+**Deliverable:** Clean branch with one commit per chapter
+
+### S3.3 — Full evaluation run
+
+Run all per-entity evaluations and collection checks on the completed
+infospace. Establish baseline metrics. Demonstrate the viability
+dashboard.
+
+**Maps to:** INFRA-TASKS #6
+**Depends on:** S2.3, S2.4, S2.5, S3.2
+**Deliverable:** Complete evaluation results + viability report
+
+### S3.4 — Rewrite tutorial
+
+Update `TUTORIAL.md` to use infospace tooling commands instead of
+raw `process_chapters.py` invocations. The tutorial should walk
+through:
+
+1. Initialising an infospace (`markitect infospace init`)
+2. Defining schemas and competency questions
+3. Processing chapters (pipeline execution)
+4. Evaluating entities (`markitect infospace evaluate`)
+5. Running collection checks (`markitect infospace check`)
+6. Reviewing viability (`markitect infospace viability`)
+7. Iterating: refining guidelines, re-processing, re-evaluating
+8. Using the infospace as a discipline for a new project
+
+**Depends on:** S3.1-S3.3
+**Deliverable:** Revised `TUTORIAL.md`
+
+### S3.5 — Demonstrate composition
+
+Create a minimal second infospace (e.g. a modern supply chain case
+study or a different economic text) that binds the Wealth of Nations
+infospace as a discipline. Demonstrates the composition model from S2.6.
+
+**Depends on:** S2.6, S3.3
+**Deliverable:** Second example infospace + composition tutorial section
+
+---
+
+## Task Mapping
+
+Cross-reference between INFRA-TASKS numbers and roadmap stages:
+
+| INFRA-TASK | Description | Stage |
+|------------|-------------|-------|
+| 1-3 | Infra fixes (resolved) | — |
+| 4 | Per-chapter git history | S3.2 |
+| 5 | Prompt file side-effects | S1.6 (batch eval avoids this) |
+| 6 | Stale metrics | S3.3 |
+| 7 | Remaining 28 chapters | S3.2 |
+| 8 | Per-concept quality metrics in schema | S2.3 |
+| 9 | Evaluate-entity prompt template | S2.3 |
+| 10 | Deterministic schema compliance | S1.2 |
+| 11 | Structured metrics output | S1.5 |
+| 12 | Metrics-over-time tracking | S2.5 |
+| 13 | Entity metadata index | S1.1 |
+| 14 | Redundancy detection (C1) | S2.4 |
+| 15 | Coverage completeness (C2) | S2.4 |
+| 16 | Structural coherence (C3) | S2.4 |
+| 17 | Definitional consistency (C4) | S2.4 |
+| 18 | Granularity balance (C5) | S2.4 |
+| 19 | Unified collection evaluation | S2.4 |
+
+---
+
+## Implementation Order
+
+Recommended sequence, accounting for dependencies and value delivery:
+
+**Phase A — Foundation (Stage 1, parallelisable)**
+1. S1.1 Entity metadata parser
+2. S1.3 Embedding adapter
+3. S1.4 Graph analysis utilities
+
+**Phase B — Validation & Output (Stage 1)**
+4. S1.2 Schema compliance validator (needs S1.1)
+5. S1.5 Structured evaluation output (needs S1.1)
+6. S1.7 FCA computation (needs S1.1)
+
+**Phase C — Orchestration (Stage 1 → Stage 2 bridge)**
+7. S1.6 Batch LLM evaluation orchestrator (needs S1.5)
+
+**Phase D — Infospace Core (Stage 2)**
+8. S2.1 Infospace model and configuration
+9. S2.2 Lifecycle commands
+10. S2.3 Per-entity evaluation primitives (needs S1.6, S2.1)
+
+**Phase E — Collection Intelligence (Stage 2)**
+11. S2.4 Collection-level checks (needs S1.3, S1.4, S1.7, S2.1)
+12. S2.5 Metrics history and viability tracking
+
+**Phase F — Composition (Stage 2)**
+13. S2.6 Infospace composition model
+14. S2.7 Documentation
+
+**Phase G — Example (Stage 3)**
+15. S3.1 Migrate example to infospace config
+16. S3.2 Clean per-chapter history
+17. S3.3 Full evaluation run
+18. S3.4 Rewrite tutorial
+19. S3.5 Demonstrate composition