- coverage.py: rewrite module docstring to explain what the metric actually computes (domain × chapter cross-tabulation, not VSM system coverage), what it does not capture (entity connectivity → C3), and when the threshold is appropriate - CoverageReport: add domain_densities, density_std, cross_cutting_ratio for distribution-level insight beyond the aggregate ratio - check_coverage: compute per-domain density and cross-cutting ratio - METRICS-METHODOLOGY.md: correct C2 section to match implementation, document the distribution-based interpretation, add implementation status table distinguishing what is wired vs planned Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
572 lines
26 KiB
Markdown
572 lines
26 KiB
Markdown
# Collection-Level Metrics Methodology
|
||
|
||
How we evaluate the quality of the infospace as a **collection of
|
||
interrelated concepts**, beyond the quality of individual entities.
|
||
|
||
This document describes the theoretical frameworks drawn from ontology
|
||
engineering, formal concept analysis, semiotic quality theory, and DSL
|
||
design — and how each is adapted to work within MarkiTect's two-layer
|
||
evaluation model (LLM-Eval + deterministic aggregation).
|
||
|
||
---
|
||
|
||
## 1. The Two-Layer Model
|
||
|
||
Every metric in this methodology decomposes into two layers:
|
||
|
||
| Layer | What it does | How it runs |
|
||
|-------|-------------|-------------|
|
||
| **LLM-Eval** | Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" | Prompt template → LLM → structured YAML output |
|
||
| **Deterministic** | Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection | Python code in `process_chapters.py` or dedicated `metrics.py` |
|
||
|
||
The LLM-Eval layer produces **per-entity** or **per-pair** structured
|
||
scores. The deterministic layer **aggregates** these into collection-level
|
||
metrics, persisted as machine-readable YAML alongside human-readable
|
||
markdown reports.
|
||
|
||
Per-concept quality metrics (definition precision, source grounding, VSM
|
||
relevance — see INFRA-TASKS 8-12) operate at the individual entity level.
|
||
This document covers the five **collection-level concerns** that assess how
|
||
the entities work together as an explanatory system.
|
||
|
||
---
|
||
|
||
## 2. Five Collection-Level Concerns
|
||
|
||
### Overview
|
||
|
||
| # | Concern | Question | Primary framework |
|
||
|---|---------|----------|-------------------|
|
||
| C1 | Semantic Overlap | Are there redundant concepts? | OOPS! P2, embedding similarity |
|
||
| C2 | Coverage Completeness | Does the concept set cover the domain? | SEQUAL, FCA |
|
||
| C3 | Structural Coherence | Do concepts form a connected explanatory graph? | OntoQA, graph theory |
|
||
| C4 | Definitional Consistency | Are concepts defined consistently and non-circularly? | OntoClean, OOPS! P24 |
|
||
| C5 | Granularity Balance | Are concepts at comparable levels of abstraction? | Granularity theory, DSL laconicity |
|
||
|
||
---
|
||
|
||
## 3. Theoretical Frameworks
|
||
|
||
### 3.1 SEQUAL (Semiotic Quality Framework)
|
||
|
||
**Origin:** Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.
|
||
|
||
**What it defines:** Quality of a conceptual model as the correspondence
|
||
between three worlds — the domain (what exists), the model (what we
|
||
captured), and the audience's interpretation (what they understand).
|
||
|
||
Two key dimensions of **semantic quality**:
|
||
|
||
- **Validity** — everything in the model corresponds to something real
|
||
in the domain. No invented concepts.
|
||
- **Completeness** — everything relevant in the domain is represented in
|
||
the model. No missing concepts.
|
||
|
||
**How we use it:** SEQUAL frames our entire metrics approach. Every
|
||
collection-level metric maps to one of these dimensions:
|
||
|
||
| SEQUAL dimension | Our concerns |
|
||
|-----------------|--------------|
|
||
| Validity | C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid) |
|
||
| Completeness | C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps) |
|
||
| Both | C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity]) |
|
||
|
||
**Adaptation:** SEQUAL was designed for formal models evaluated by human
|
||
experts. We replace human judgment with LLM-Eval (for validity checks like
|
||
"does this concept correspond to something Smith actually described?") and
|
||
deterministic counting (for completeness checks like "which VSM systems
|
||
lack entity mappings?").
|
||
|
||
### 3.2 OntoClean
|
||
|
||
**Origin:** Guarino & Welty (2004).
|
||
|
||
**What it defines:** A methodology for validating taxonomic relationships
|
||
by assigning **metaproperties** to each concept:
|
||
|
||
- **Rigidity** — Is the property essential to all its instances? (e.g.
|
||
"market" is rigid; "effectual demander" is anti-rigid — an agent can
|
||
stop being an effectual demander)
|
||
- **Identity** — Does the concept carry an identity criterion? (e.g.
|
||
"division of labour" can be identified by its three causal mechanisms)
|
||
- **Unity** — Are all instances of this concept whole in the same way?
|
||
- **Dependence** — Does the concept require another concept to exist?
|
||
(e.g. "market price" depends on "effectual demand")
|
||
|
||
**Constraint:** A rigid concept cannot be subsumed by an anti-rigid one.
|
||
Violations indicate structural confusion.
|
||
|
||
**How we use it:** We do not have a formal taxonomy, but our flat entity
|
||
set implicitly contains subsumption relationships (e.g. "natural rate"
|
||
subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:
|
||
|
||
- **Granularity mismatches** (C5): A rigid concept at the same level as
|
||
an anti-rigid one suggests different abstraction levels are mixed.
|
||
- **Definitional consistency** (C4): If entity A depends on entity B per
|
||
OntoClean, but B's definition doesn't acknowledge A, the definitions
|
||
are inconsistent.
|
||
- **Redundancy** (C1): Two entities with identical metaproperty profiles
|
||
and overlapping definitions are candidates for merging.
|
||
|
||
**Adaptation:** Instead of manual metaproperty assignment, we use LLM-Eval
|
||
to classify each entity's rigidity, identity criterion, and dependencies.
|
||
The constraint checking is then deterministic.
|
||
|
||
### 3.3 OOPS! (Ontology Pitfall Scanner)
|
||
|
||
**Origin:** Poveda-Villalón et al. (2014). Catalogue of 41 common
|
||
ontology design pitfalls.
|
||
|
||
**What it defines:** Concrete, testable anti-patterns. The pitfalls most
|
||
relevant to our infospace:
|
||
|
||
| Pitfall | Description | Our concern |
|
||
|---------|-------------|-------------|
|
||
| P2 | Synonymous classes — different names, same meaning | C1 (redundancy) |
|
||
| P4 | Unconnected ontology elements | C3 (coherence) |
|
||
| P6 | Missing inverse relationships | C3 |
|
||
| P7 | Merging different concepts in the same class | C5 (granularity — too coarse) |
|
||
| P11 | Missing domain or range | C4 (consistency) |
|
||
| P19 | Missing disjointness axioms | C1 (how do we know two concepts don't overlap?) |
|
||
| P24 | Recursive/circular definition | C4 (consistency) |
|
||
| P25 | Inverse of itself | C4 |
|
||
|
||
**How we use it:** OOPS! pitfalls become a **checklist for LLM-Eval
|
||
prompts**. Rather than running a formal OWL scanner, we ask the LLM to
|
||
check for each pitfall pattern:
|
||
|
||
- "Are entities A and B synonymous?" (P2)
|
||
- "Does entity A's definition reference itself?" (P24)
|
||
- "Is entity A actually two distinct concepts merged together?" (P7)
|
||
|
||
The deterministic layer counts pitfall occurrences and tracks them over
|
||
time.
|
||
|
||
**Adaptation:** We select the subset of OOPS! pitfalls applicable to
|
||
semi-formal markdown-based ontologies (no OWL axioms) and implement each
|
||
as an LLM-Eval prompt pattern rather than a formal reasoner check.
|
||
|
||
### 3.4 OntoQA (Metric-Based Ontology Quality Analysis)
|
||
|
||
**Origin:** Tartir & Arpinar (2007).
|
||
|
||
**What it defines:** Quantitative schema-level and instance-level metrics:
|
||
|
||
- **Relationship Richness (RR):** Proportion of non-taxonomic (lateral)
|
||
relationships to total relationships. `RR = non_hierarchical / total`.
|
||
Low RR = mere taxonomy. High RR = rich cross-cutting connections.
|
||
- **Attribute Richness (AR):** Average number of attributes per concept.
|
||
`AR = total_attributes / total_concepts`.
|
||
- **Inheritance Richness (IR):** Average subclasses per class — measures
|
||
how knowledge distributes across the hierarchy.
|
||
- **Class Richness (CR):** Proportion of classes with instances.
|
||
|
||
**How we use it:** Our entities don't have formal relationships declared
|
||
between them, but we can **infer** a relationship graph from their
|
||
definitions and mappings:
|
||
|
||
- Entity A references entity B in its definition → definitional dependency
|
||
- Entities A and B map to the same VSM system → structural co-occurrence
|
||
- Entities A and B appear in the same chapter → contextual co-occurrence
|
||
|
||
From this inferred graph, we compute OntoQA metrics directly:
|
||
|
||
- **Relationship Richness** tells us whether our concepts form a web of
|
||
explanatory connections or just a flat list.
|
||
- **Attribute Richness** maps to our schema sections — entities with more
|
||
optional sections filled (Original Wording, Modern Interpretation) are
|
||
richer.
|
||
|
||
**Adaptation:** The key modification is that relationship inference is an
|
||
LLM-Eval step (pairwise: "does A's definition depend on or reference B?"),
|
||
after which all OntoQA metrics are computed deterministically on the
|
||
resulting graph.
|
||
|
||
### 3.5 Formal Concept Analysis (FCA)
|
||
|
||
**Origin:** Wille (1982). Applied to ontology auditing by Elhaj et al.
|
||
(2008) for SNOMED CT completeness checking.
|
||
|
||
**What it defines:** A mathematical framework for deriving a **concept
|
||
lattice** from a binary relation between objects and attributes. The
|
||
lattice reveals:
|
||
|
||
- **Formal concepts**: maximal sets of objects sharing the same attributes
|
||
- **Subconcept/superconcept** relationships: the natural hierarchy
|
||
- **Missing concepts**: attribute combinations with no corresponding object
|
||
|
||
**How we use it:** We construct a **formal context** (binary matrix):
|
||
|
||
- **Objects** = our 85 entities
|
||
- **Attributes** = economic domain, VSM system, source book, abstraction
|
||
level (from LLM-Eval), key terms (extracted from definitions)
|
||
|
||
The concept lattice then reveals:
|
||
|
||
- **Coverage gaps** (C2): Attribute combinations with no entity. E.g. if
|
||
the cell {Distribution, S3} is empty, we lack control-layer concepts
|
||
for distribution — a specific, actionable gap.
|
||
- **Redundancy** (C1): Entities with identical attribute sets (same formal
|
||
concept) are candidates for merging.
|
||
- **Granularity** (C5): The lattice depth indicates how many meaningful
|
||
levels of abstraction exist. A shallow lattice suggests missing
|
||
intermediate concepts.
|
||
|
||
**Adaptation:** Classic FCA requires crisp binary attributes. Our domains
|
||
and VSM mappings are already categorical, but abstraction level and key
|
||
terms need LLM-Eval to produce. The lattice computation itself is
|
||
deterministic (Python `concepts` library or equivalent). The FCA approach
|
||
replaces the current "ask the LLM about coverage" with a structural
|
||
computation that can identify *specific* gaps rather than vague
|
||
recommendations.
|
||
|
||
### 3.6 DSL Design Principles
|
||
|
||
**Origin:** Mernik et al. (2005) "When and How to Develop DSLs";
|
||
Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".
|
||
|
||
**What they define:** Quality criteria for a set of concepts that form a
|
||
language for a specific domain:
|
||
|
||
- **Soundness**: Every concept in the language corresponds to a real domain
|
||
concern (no invented abstractions).
|
||
- **Completeness**: The language can express everything needed for its
|
||
intended tasks.
|
||
- **Laconicity**: No unnecessary concepts — every concept earns its place.
|
||
- **Orthogonality**: Concepts are independent; combining any two produces
|
||
a meaningful result (no redundant combinations).
|
||
|
||
**How we use it:** Our entity set is effectively a domain-specific
|
||
vocabulary for "explaining classical economics through VSM". DSL quality
|
||
criteria translate directly:
|
||
|
||
- **Soundness** → Validity (SEQUAL): every entity grounded in Smith's text
|
||
- **Completeness** → Coverage (C2): can we answer the "competency
|
||
questions" the infospace is meant to address?
|
||
- **Laconicity** → Anti-redundancy (C1) + Indispensability (C5): would
|
||
removing any entity lose explanatory power?
|
||
- **Orthogonality** → Non-overlap (C1): entity definitions don't
|
||
substantially duplicate each other
|
||
|
||
**Adaptation:** We operationalise DSL completeness through **competency
|
||
questions** — a set of canonical questions the infospace should be able to
|
||
answer (e.g. "How does the division of labour relate to market extent?",
|
||
"What mechanisms regulate wages toward their natural rate?"). LLM-Eval
|
||
tests whether the current entity set suffices to answer each question.
|
||
Unanswerable questions identify specific completeness gaps.
|
||
|
||
Laconicity is operationalised as **indispensability scoring**: for each
|
||
entity, LLM-Eval rates whether removing it would lose explanatory power.
|
||
Low-scoring entities are candidates for merging or retirement.
|
||
|
||
---
|
||
|
||
## 4. Integration: Metric Definitions by Concern
|
||
|
||
### C1: Semantic Overlap / Redundancy
|
||
|
||
**Goal:** Identify entities that substantially overlap in meaning and
|
||
should be merged, distinguished, or retired.
|
||
|
||
**Metrics:**
|
||
|
||
| Metric | Type | Computation |
|
||
|--------|------|-------------|
|
||
| `similarity_matrix` | Deterministic | Embed all entity definitions; compute NxN cosine similarity |
|
||
| `high_similarity_pairs` | Deterministic | Pairs with cosine > 0.80, sorted descending |
|
||
| `confirmed_synonyms` | LLM-Eval | For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap" |
|
||
| `redundancy_ratio` | Deterministic | `confirmed_synonyms / total_entities` |
|
||
| `intensional_conciseness` | Deterministic | `1 - redundancy_ratio` (from KG quality framework) |
|
||
|
||
**Pipeline:**
|
||
1. Embed definitions (embedding API or local model)
|
||
2. Compute cosine similarity matrix
|
||
3. Filter pairs above threshold
|
||
4. LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
|
||
5. Aggregate into ratio and conciseness score
|
||
|
||
**Output:** `output/metrics/redundancy-report.md` + structured YAML with
|
||
pair list, scores, and merge/retire recommendations.
|
||
|
||
### C2: Coverage Completeness
|
||
|
||
**Goal:** Identify domain areas that are structurally sparse or isolated
|
||
within the corpus — and separately, assess whether the entity set can answer
|
||
the infospace's declared competency questions.
|
||
|
||
**What the deterministic check actually computes**
|
||
|
||
The current implementation builds a binary *domain × chapter* cross-table:
|
||
one row per economic domain, one column per source chapter. A cell is
|
||
populated if at least one entity has that (domain, chapter) combination.
|
||
|
||
coverage_ratio = populated_cells / (n_domains × n_chapters)
|
||
|
||
**This is not the same as VSM coverage.** The domain × VSM matrix described
|
||
in earlier versions of this document requires VSM system mappings to be
|
||
supplied as `extra_attributes` to `check_coverage()`. The pipeline does not
|
||
currently do this, so `coverage_ratio` reflects *cross-chapter domain
|
||
distribution*, not *VSM system coverage*.
|
||
|
||
**Important: interpret the distribution, not just the ratio**
|
||
|
||
The aggregate ratio conflates two structurally different situations:
|
||
|
||
| Situation | coverage_ratio | What it means |
|
||
|---|---|---|
|
||
| Healthy topic separation | Low | Domains are locally dense within their book/section — expected for a multi-topic corpus |
|
||
| Fragmented extraction | Low | Domains appear sporadically everywhere, never anchored |
|
||
|
||
Both produce the same ratio. Use the per-domain density distribution to
|
||
distinguish them:
|
||
|
||
| Metric | Meaning |
|
||
|--------|---------|
|
||
| `domain_densities` | Per-domain fraction of chapters containing ≥1 entity with that domain |
|
||
| `density_std` | Standard deviation of densities. High std → healthy topic separation (bimodal: some domains cross-cutting, others local). Low std → uniform but thin. |
|
||
| `cross_cutting_ratio` | Fraction of domains appearing in >50 % of chapters — the foundational, cross-cutting concepts. |
|
||
|
||
Example interpretation for the WoN/VSM infospace (1021 entities, 35 chapters):
|
||
|
||
```
|
||
Exchange 0.848 ████████████████ cross-cutting
|
||
Regulation 0.848 ████████████████ cross-cutting
|
||
General Theory 0.727 ██████████████ cross-cutting
|
||
Production 0.636 ████████████ cross-cutting
|
||
Distribution 0.576 ███████████ borderline
|
||
Accumulation 0.364 ███████ book-specific
|
||
Consumption 0.333 ██████ book-specific
|
||
|
||
density_std = 0.33 (high → healthy topic separation)
|
||
cross_cutting_ratio = 0.50
|
||
coverage_ratio = 0.44 (below 0.50 threshold, but for correct reasons)
|
||
```
|
||
|
||
**What coverage does NOT capture**
|
||
|
||
- **Entity-to-entity connections** — whether concepts reference each other,
|
||
form explanatory chains, or cluster coherently. That is C3 (Structural
|
||
Coherence).
|
||
- **VSM competency question answerability** — whether current entities
|
||
collectively support answering the declared competency questions. That
|
||
requires LLM-Eval and is a planned metric (see below).
|
||
- **Whether absent (domain, chapter) cells are meaningful gaps or expected
|
||
absences** — the ratio treats them identically.
|
||
|
||
**Threshold guidance**
|
||
|
||
- `min: 0.50` is appropriate for a focused, single-topic corpus where all
|
||
chapters address the same set of domains.
|
||
- For heterogeneous multi-book corpora, domains introduced late create empty
|
||
cells for all earlier chapters. A threshold of `0.30–0.40` is more
|
||
realistic.
|
||
- Prefer `cross_cutting_ratio` and `density_std` as the primary diagnostic
|
||
signals; use `coverage_ratio` only for trend tracking across snapshots.
|
||
|
||
**Metrics:**
|
||
|
||
| Metric | Type | Computation | Status |
|
||
|--------|------|-------------|--------|
|
||
| `coverage_ratio` | Deterministic | `populated_cells / (n_domains × n_chapters)` | ✅ Implemented |
|
||
| `domain_densities` | Deterministic | Per-domain fraction of chapters with ≥1 entity | ✅ Implemented |
|
||
| `density_std` | Deterministic | Std dev of domain densities | ✅ Implemented |
|
||
| `cross_cutting_ratio` | Deterministic | Fraction of domains with density > 0.5 | ✅ Implemented |
|
||
| `empty_cells` | Deterministic | List of unpopulated (domain, chapter) pairs | ✅ Implemented |
|
||
| `fca_gap_concepts` | Deterministic | Attribute combos in FCA lattice with no entity | ✅ Implemented |
|
||
| `domain_vsm_matrix` | Deterministic | Entities per {domain, VSM_system} cell — requires VSM mappings in `extra_attributes` | ⬜ Not yet wired |
|
||
| `competency_coverage` | LLM-Eval | For each competency question, can it be answered? | ⬜ Not yet implemented |
|
||
|
||
**Pipeline (current):**
|
||
1. Parse entity metadata (domain, source chapter) from entity files
|
||
2. Build domain × chapter binary matrix; identify empty cells
|
||
3. Compute per-domain densities, std dev, cross-cutting ratio
|
||
4. Build FCA formal context; extract gap concepts
|
||
5. Aggregate into `CoverageReport`
|
||
|
||
**Output:** Snapshot recorded in `output/metrics/history.yaml`. A
|
||
`coverage-report.md` per chapter is planned but not yet generated.
|
||
|
||
### C3: Structural Coherence
|
||
|
||
**Goal:** Determine whether the entities form a connected explanatory web
|
||
or a fragmented collection of isolated concepts.
|
||
|
||
**Metrics:**
|
||
|
||
| Metric | Type | Computation |
|
||
|--------|------|-------------|
|
||
| `relationship_graph` | LLM-Eval + Deterministic | Infer edges from definition cross-references (string matching) + LLM judgment for implicit references |
|
||
| `connected_components` | Deterministic | Number of connected components in the graph (target: 1) |
|
||
| `graph_density` | Deterministic | `actual_edges / possible_edges` |
|
||
| `avg_degree` | Deterministic | `total_edges / total_entities` |
|
||
| `relationship_richness` | Deterministic | OntoQA RR: `non_hierarchical_edges / total_edges` |
|
||
| `modularity` | Deterministic | Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation) |
|
||
| `bridge_concepts` | Deterministic | Entities with highest betweenness centrality (connect clusters) |
|
||
| `orphan_entities` | Deterministic | Entities with degree 0 or 1 |
|
||
| `cohesion_by_domain` | Deterministic | Avg intra-domain edges per entity |
|
||
| `coupling_across_domains` | Deterministic | Inter-domain edges / total edges |
|
||
|
||
**Pipeline:**
|
||
1. Extract explicit cross-references from definitions (entity name
|
||
mentions in other definitions — string matching with slug normalisation)
|
||
2. For entity pairs not caught by string matching, LLM-Eval: "Does A's
|
||
definition depend on or reference B's concept?"
|
||
3. Build directed graph
|
||
4. Compute graph metrics (networkx or equivalent)
|
||
5. Run community detection; compare detected communities to declared
|
||
economic domains
|
||
|
||
**Output:** `output/metrics/coherence-report.md` + YAML with graph
|
||
statistics, orphan list, bridge concepts, and community structure.
|
||
|
||
### C4: Definitional Consistency
|
||
|
||
**Goal:** Ensure entities are defined consistently, non-circularly, and
|
||
without contradicting each other.
|
||
|
||
**Metrics:**
|
||
|
||
| Metric | Type | Computation |
|
||
|--------|------|-------------|
|
||
| `definitional_dependency_graph` | Deterministic + LLM-Eval | Edges where A's definition uses B's concept |
|
||
| `circular_definitions` | Deterministic | Cycles of length ≤ 3 in the dependency graph |
|
||
| `definition_depth` | Deterministic | Longest dependency chain per entity before reaching a term not in the entity set |
|
||
| `undefined_dependencies` | Deterministic | Terms used in definitions that arguably should be entities but aren't |
|
||
| `pairwise_consistency` | LLM-Eval | For related entity pairs (sharing edges): "Do these definitions contradict each other?" |
|
||
| `source_fidelity` | LLM-Eval | "Does this definition accurately represent what Smith wrote in the cited passage?" |
|
||
| `metaproperty_violations` | LLM-Eval + Deterministic | OntoClean constraint checking after LLM classifies rigidity/identity |
|
||
| `grounding_ratio` | Deterministic | Fraction of entities traceable to primitives without cycles |
|
||
|
||
**Pipeline:**
|
||
1. Build definitional dependency graph (same technique as C3, but directed
|
||
— A depends on B means A's definition uses B, not vice versa)
|
||
2. Detect cycles; flag short cycles
|
||
3. Extract undefined terms (terms matching entity-name patterns that appear
|
||
in definitions but have no corresponding entity file)
|
||
4. LLM pairwise consistency check on directly-connected pairs
|
||
5. LLM source fidelity check (compare definition to source chapter text)
|
||
6. LLM OntoClean metaproperty classification; deterministic constraint
|
||
checking
|
||
|
||
**Output:** `output/metrics/consistency-report.md` + YAML with cycle list,
|
||
undefined terms, contradiction candidates, and metaproperty violations.
|
||
|
||
### C5: Granularity Balance
|
||
|
||
**Goal:** Ensure entities operate at comparable levels of abstraction
|
||
within their respective domains and perspectives.
|
||
|
||
**Metrics:**
|
||
|
||
| Metric | Type | Computation |
|
||
|--------|------|-------------|
|
||
| `abstraction_classification` | LLM-Eval | Classify each entity as theory-level / mechanism-level / observation-level |
|
||
| `scope_score` | LLM-Eval | Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle) |
|
||
| `abstraction_distribution` | Deterministic | Count per level; compute entropy |
|
||
| `scope_variance` | Deterministic | Variance of scope scores within each domain |
|
||
| `level_x_perspective_matrix` | Deterministic | Cross-tabulation of abstraction level × economic domain |
|
||
| `indispensability` | LLM-Eval | "If removed, what explanatory power is lost?" (1-5) |
|
||
| `dispensable_entities` | Deterministic | Entities with indispensability score ≤ 2 |
|
||
| `merge_candidates` | LLM-Eval | Pairs where one is a sub-case of the other |
|
||
|
||
**Pipeline:**
|
||
1. LLM-classify each entity: abstraction level, scope score,
|
||
indispensability
|
||
2. Build level × perspective matrix
|
||
3. Compute distribution entropy and per-domain scope variance
|
||
4. Flag outliers: entities whose scope score deviates > 1.5σ from their
|
||
domain mean
|
||
5. For outlier entities, LLM-Eval: "Should this be merged into a broader
|
||
concept, or split into sub-concepts?"
|
||
|
||
**Output:** `output/metrics/granularity-report.md` + YAML with
|
||
classifications, distribution, outliers, and merge/split recommendations.
|
||
|
||
---
|
||
|
||
## 5. Shared Infrastructure
|
||
|
||
Several concerns share underlying computations:
|
||
|
||
| Infrastructure | Used by | Build once |
|
||
|---------------|---------|------------|
|
||
| Definition embeddings (vector per entity) | C1, C3 | Embedding API call per entity |
|
||
| Relationship graph (entity → entity edges) | C3, C4 | String matching + LLM-Eval |
|
||
| FCA formal context (entity × attribute matrix) | C2, C5 | Metadata parsing + LLM classification |
|
||
| Entity metadata index (domain, VSM, chapter, sections) | C2, C5, C10 (schema compliance) | Deterministic markdown parsing |
|
||
|
||
These should be computed once per evaluation run and cached for use by
|
||
all concern-specific metrics.
|
||
|
||
---
|
||
|
||
## 6. Evaluation Workflow
|
||
|
||
A full collection-level evaluation run:
|
||
|
||
```
|
||
process_chapters.py --evaluate-collection --provider <provider>
|
||
```
|
||
|
||
1. **Parse** — deterministic metadata extraction from all entity files
|
||
2. **Embed** — compute definition embeddings (cached; only new/changed
|
||
entities need fresh embeddings)
|
||
3. **Infer** — LLM-Eval for relationship edges, metaproperties,
|
||
abstraction levels, pairwise judgments (batched to minimise LLM calls)
|
||
4. **Compute** — deterministic graph metrics, FCA lattice, coverage
|
||
matrix, similarity matrix, cycle detection
|
||
5. **Aggregate** — combine per-entity and per-pair scores into
|
||
collection-level metrics
|
||
6. **Report** — write per-concern markdown reports + unified `metrics.yaml`
|
||
7. **Append** — add timestamped snapshot to `metrics-history.yaml`
|
||
|
||
Incremental mode (`--evaluate-collection --chapter <id>`) re-evaluates
|
||
only the entities introduced or modified by that chapter, plus any
|
||
pairwise checks involving those entities.
|
||
|
||
---
|
||
|
||
## 7. References
|
||
|
||
- Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding
|
||
Quality in Conceptual Modeling." *IEEE Software* 11(2), 42-49.
|
||
→ SEQUAL framework: validity and completeness dimensions.
|
||
|
||
- Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In
|
||
*Handbook on Ontologies*, Springer, 151-171.
|
||
→ Metaproperty analysis: rigidity, identity, unity, dependence.
|
||
|
||
- Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014).
|
||
"OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology
|
||
Evaluation." *IJSWIS* 10(2), 7-34.
|
||
→ Pitfall catalogue: 41 anti-patterns for ontology design.
|
||
|
||
- Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking
|
||
using OntoQA." *ICSC 2007*, IEEE, 185-192.
|
||
→ Schema metrics: relationship richness, attribute richness.
|
||
|
||
- Wille, R. (1982). "Restructuring Lattice Theory." In *Ordered Sets*,
|
||
Reidel, 445-470.
|
||
→ Formal Concept Analysis: concept lattices from binary contexts.
|
||
|
||
- Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept
|
||
Analysis." *AMIA Annual Symposium*, PMC2605587.
|
||
→ FCA for ontology completeness auditing.
|
||
|
||
- Keet, C.M. (2008). *A Formal Theory of Granularity.* PhD thesis,
|
||
Free University of Bozen-Bolzano.
|
||
→ Granularity levels and perspectives for ontology design.
|
||
|
||
- Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to
|
||
Develop Domain-Specific Languages." *ACM Computing Surveys* 37(4),
|
||
316-344.
|
||
→ DSL design: soundness, completeness, laconicity.
|
||
|
||
- Karsai, G. et al. (2014). "Design Guidelines for Domain Specific
|
||
Languages." *arXiv:1409.2378*.
|
||
→ Orthogonality, necessary-and-sufficient principle.
|
||
|
||
- Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A
|
||
Comprehensive Survey." *IEEE TKDE* 35(5), 4969-4988.
|
||
→ KG quality dimensions: conciseness, consistency, completeness.
|