docs(metrics): clarify C2 coverage — domain×chapter matrix, not domain×VSM

- coverage.py: rewrite module docstring to explain what the metric actually computes (domain × chapter cross-tabulation, not VSM system coverage), what it does not capture (entity connectivity → C3), and when the threshold is appropriate - CoverageReport: add domain_densities, density_std, cross_cutting_ratio for distribution-level insight beyond the aggregate ratio - check_coverage: compute per-domain density and cross-cutting ratio - METRICS-METHODOLOGY.md: correct C2 section to match implementation, document the distribution-based interpretation, add implementation status table distinguishing what is wired vs planned Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 00:08:46 +01:00
parent 0f54f094e4
commit dfe56a4f9b
2 changed files with 177 additions and 23 deletions
--- a/examples/infospace-with-history/METRICS-METHODOLOGY.md
+++ b/examples/infospace-with-history/METRICS-METHODOLOGY.md
@@ -290,31 +290,101 @@ pair list, scores, and merge/retire recommendations.

 ### C2: Coverage Completeness

-**Goal:** Identify domain areas and VSM systems that lack adequate
-representation in the entity set.
+**Goal:** Identify domain areas that are structurally sparse or isolated
+within the corpus — and separately, assess whether the entity set can answer
+the infospace's declared competency questions.
+
+**What the deterministic check actually computes**
+
+The current implementation builds a binary *domain × chapter* cross-table:
+one row per economic domain, one column per source chapter.  A cell is
+populated if at least one entity has that (domain, chapter) combination.
+
+    coverage_ratio = populated_cells / (n_domains × n_chapters)
+
+**This is not the same as VSM coverage.**  The domain × VSM matrix described
+in earlier versions of this document requires VSM system mappings to be
+supplied as `extra_attributes` to `check_coverage()`.  The pipeline does not
+currently do this, so `coverage_ratio` reflects *cross-chapter domain
+distribution*, not *VSM system coverage*.
+
+**Important: interpret the distribution, not just the ratio**
+
+The aggregate ratio conflates two structurally different situations:
+
+| Situation | coverage_ratio | What it means |
+|---|---|---|
+| Healthy topic separation | Low | Domains are locally dense within their book/section — expected for a multi-topic corpus |
+| Fragmented extraction | Low | Domains appear sporadically everywhere, never anchored |
+
+Both produce the same ratio.  Use the per-domain density distribution to
+distinguish them:
+
+| Metric | Meaning |
+|--------|---------|
+| `domain_densities` | Per-domain fraction of chapters containing ≥1 entity with that domain |
+| `density_std` | Standard deviation of densities.  High std → healthy topic separation (bimodal: some domains cross-cutting, others local).  Low std → uniform but thin. |
+| `cross_cutting_ratio` | Fraction of domains appearing in >50 % of chapters — the foundational, cross-cutting concepts. |
+
+Example interpretation for the WoN/VSM infospace (1021 entities, 35 chapters):
+
+```
+Exchange        0.848  ████████████████   cross-cutting
+Regulation      0.848  ████████████████   cross-cutting
+General Theory  0.727  ██████████████     cross-cutting
+Production      0.636  ████████████       cross-cutting
+Distribution    0.576  ███████████        borderline
+Accumulation    0.364  ███████            book-specific
+Consumption     0.333  ██████             book-specific
+
+density_std = 0.33   (high → healthy topic separation)
+cross_cutting_ratio = 0.50
+coverage_ratio = 0.44  (below 0.50 threshold, but for correct reasons)
+```
+
+**What coverage does NOT capture**
+
+- **Entity-to-entity connections** — whether concepts reference each other,
+  form explanatory chains, or cluster coherently.  That is C3 (Structural
+  Coherence).
+- **VSM competency question answerability** — whether current entities
+  collectively support answering the declared competency questions.  That
+  requires LLM-Eval and is a planned metric (see below).
+- **Whether absent (domain, chapter) cells are meaningful gaps or expected
+  absences** — the ratio treats them identically.
+
+**Threshold guidance**
+
+- `min: 0.50` is appropriate for a focused, single-topic corpus where all
+  chapters address the same set of domains.
+- For heterogeneous multi-book corpora, domains introduced late create empty
+  cells for all earlier chapters.  A threshold of `0.30–0.40` is more
+  realistic.
+- Prefer `cross_cutting_ratio` and `density_std` as the primary diagnostic
+  signals; use `coverage_ratio` only for trend tracking across snapshots.

 **Metrics:**

-| Metric | Type | Computation |
-|--------|------|-------------|
-| `domain_vsm_matrix` | Deterministic | Count entities per {economic_domain, VSM_system} cell |
-| `coverage_ratio` | Deterministic | `populated_cells / expected_cells` |
-| `vsm_balance_entropy` | Deterministic | Shannon entropy of entity distribution across VSM systems (higher = more balanced) |
-| `empty_cells` | Deterministic | List of {domain, VSM_system} pairs with zero entities |
-| `competency_coverage` | LLM-Eval | For each competency question, can it be answered with current entities? |
-| `fca_gap_concepts` | Deterministic | Attribute combinations in the FCA lattice with no corresponding entity |
+| Metric | Type | Computation | Status |
+|--------|------|-------------|--------|
+| `coverage_ratio` | Deterministic | `populated_cells / (n_domains × n_chapters)` | ✅ Implemented |
+| `domain_densities` | Deterministic | Per-domain fraction of chapters with ≥1 entity | ✅ Implemented |
+| `density_std` | Deterministic | Std dev of domain densities | ✅ Implemented |
+| `cross_cutting_ratio` | Deterministic | Fraction of domains with density > 0.5 | ✅ Implemented |
+| `empty_cells` | Deterministic | List of unpopulated (domain, chapter) pairs | ✅ Implemented |
+| `fca_gap_concepts` | Deterministic | Attribute combos in FCA lattice with no entity | ✅ Implemented |
+| `domain_vsm_matrix` | Deterministic | Entities per {domain, VSM_system} cell — requires VSM mappings in `extra_attributes` | ⬜ Not yet wired |
+| `competency_coverage` | LLM-Eval | For each competency question, can it be answered? | ⬜ Not yet implemented |

-**Pipeline:**
-1. Parse entity metadata (domain, VSM mapping) from files on disk
-2. Build domain × VSM matrix; identify empty cells
-3. Build FCA formal context; compute lattice; extract gap concepts
-4. Define competency questions (initially hand-written, later LLM-generated
-   from the source material)
-5. LLM-evaluate answerability of each question
-6. Aggregate into coverage ratio, entropy, and gap list
+**Pipeline (current):**
+1. Parse entity metadata (domain, source chapter) from entity files
+2. Build domain × chapter binary matrix; identify empty cells
+3. Compute per-domain densities, std dev, cross-cutting ratio
+4. Build FCA formal context; extract gap concepts
+5. Aggregate into `CoverageReport`

-**Output:** `output/metrics/coverage-report.md` + YAML with matrix, gaps,
-and competency question results.
+**Output:** Snapshot recorded in `output/metrics/history.yaml`.  A
+`coverage-report.md` per chapter is planned but not yet generated.

 ### C3: Structural Coherence

--- a/markitect/infospace/checks/coverage.py
+++ b/markitect/infospace/checks/coverage.py
@@ -1,12 +1,51 @@
 """
 C2 — Coverage completeness.

-Uses FCA and cross-tabulation to detect structural coverage gaps:
-attribute combinations (domain × VSM system) with no entities.
+**What this measures**
+
+Builds a binary *domain × chapter* cross-table: rows are economic domains
+found across all entities, columns are source chapters.  A cell is marked
+populated when at least one entity has that (domain, chapter) combination.
+
+    coverage_ratio = populated_cells / (n_domains × n_chapters)
+
+This is a measure of how *uniformly* economic domains are distributed across
+source chapters, not of how richly entities connect to each other (that is
+C3 — Structural Coherence) and not of VSM competency-question answerability
+(that requires supplying ``extra_attributes`` with VSM system mappings, which
+the pipeline does not currently do).
+
+**Interpreting the ratio alone is misleading.**  A single ratio cannot
+distinguish two structurally different situations:
+
+- *Healthy topic separation* — domains are locally dense within their
+  book/section, sparse elsewhere.  The matrix has clean block structure;
+  low cross-chapter density per domain is *expected*.
+- *Fragmented extraction* — domains appear sporadically in all chapters,
+  never strongly anchored anywhere.  The matrix is uniformly thin everywhere.
+
+Both can produce the same ratio.  Use the *per-domain density distribution*
+(``domain_densities``, ``density_std``, ``cross_cutting_ratio``) to
+distinguish them:
+
+- High ``density_std`` + bimodal distribution → healthy topic separation.
+- Low ``density_std`` + uniform distribution → potential fragmentation.
+- ``cross_cutting_ratio`` measures what fraction of domains span more than
+  half the chapters — these are the foundational cross-cutting concepts.
+
+**Threshold note**
+
+A 0.50 threshold is appropriate for a focused single-topic corpus.  For a
+heterogeneous multi-book corpus (e.g. all five books of The Wealth of
+Nations), domains introduced in later books create empty cells for all
+earlier chapters, causing the ratio to fall below 0.50 even for structurally
+healthy corpora.  Consider 0.30–0.40 for large, multi-topic corpora.
 """

 from __future__ import annotations

+import math
+import statistics
 from dataclasses import dataclass, field
 from typing import Any, Dict, List, Optional

@@ -16,9 +55,30 @@ from markitect.analysis.fca import FormalContext, find_empty_cells, find_gap_con

@dataclass
 class CoverageReport:
-    """Results from coverage analysis."""
+    """Results from coverage analysis.
+
+    Attributes:
+        coverage_ratio: Fraction of (domain, chapter) cells that are
+            populated.  See module docstring for interpretation notes.
+        domain_densities: Per-domain fraction of chapters that contain
+            at least one entity with that domain.  Keys are domain names.
+        density_std: Standard deviation of ``domain_densities`` values.
+            High std suggests healthy topic separation; low std suggests
+            uniform but thin coverage.
+        cross_cutting_ratio: Fraction of domains that appear in more than
+            50 % of source chapters.  These are the foundational concepts.
+        empty_cells: List of ``{dimension_a, dimension_b}`` dicts for each
+            unpopulated (domain, chapter) cell.
+        gap_concepts: FCA gap concepts — attribute combinations present in
+            the lattice but with no entity.
+        domain_counts: Total entity count per domain.
+        entity_count: Total number of entities analysed.
+    """

    coverage_ratio: float = 0.0
+    domain_densities: Dict[str, float] = field(default_factory=dict)
+    density_std: float = 0.0
+    cross_cutting_ratio: float = 0.0
    empty_cells: List[dict] = field(default_factory=list)
    gap_concepts: List[dict] = field(default_factory=list)
    domain_counts: Dict[str, int] = field(default_factory=dict)
@@ -28,6 +88,9 @@ class CoverageReport:
        return {
            "concern": "C2",
            "coverage_ratio": round(self.coverage_ratio, 4),
+            "domain_densities": {k: round(v, 4) for k, v in self.domain_densities.items()},
+            "density_std": round(self.density_std, 4),
+            "cross_cutting_ratio": round(self.cross_cutting_ratio, 4),
            "empty_cells": self.empty_cells,
            "gap_concepts_count": len(self.gap_concepts),
            "domain_counts": self.domain_counts,
@@ -102,8 +165,29 @@ def check_coverage(
    populated = total_cells - len(empty)
    ratio = populated / total_cells if total_cells > 0 else 0.0

+    # Per-domain density: fraction of chapters that contain this domain
+    n_chapters = len(chapters)
+    domain_densities: Dict[str, float] = {}
+    if n_chapters > 0:
+        empty_pairs = {(e["dimension_a"], e["dimension_b"]) for e in empty}
+        for d in domains:
+            populated_for_domain = sum(
+                1 for c in chapters if (d, c) not in empty_pairs
+            )
+            domain_densities[d.removeprefix("domain:")] = populated_for_domain / n_chapters
+
+    density_values = list(domain_densities.values())
+    density_std = statistics.stdev(density_values) if len(density_values) >= 2 else 0.0
+    cross_cutting_ratio = (
+        sum(1 for v in density_values if v > 0.5) / len(density_values)
+        if density_values else 0.0
+    )
+
    return CoverageReport(
        coverage_ratio=ratio,
+        domain_densities=domain_densities,
+        density_std=round(density_std, 6),
+        cross_cutting_ratio=round(cross_cutting_ratio, 4),
        empty_cells=empty,
        gap_concepts=gap_dicts,
        domain_counts=domain_counts,