- coverage.py: rewrite module docstring to explain what the metric actually computes (domain × chapter cross-tabulation, not VSM system coverage), what it does not capture (entity connectivity → C3), and when the threshold is appropriate - CoverageReport: add domain_densities, density_std, cross_cutting_ratio for distribution-level insight beyond the aggregate ratio - check_coverage: compute per-domain density and cross-cutting ratio - METRICS-METHODOLOGY.md: correct C2 section to match implementation, document the distribution-based interpretation, add implementation status table distinguishing what is wired vs planned Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
26 KiB
Collection-Level Metrics Methodology
How we evaluate the quality of the infospace as a collection of interrelated concepts, beyond the quality of individual entities.
This document describes the theoretical frameworks drawn from ontology engineering, formal concept analysis, semiotic quality theory, and DSL design — and how each is adapted to work within MarkiTect's two-layer evaluation model (LLM-Eval + deterministic aggregation).
1. The Two-Layer Model
Every metric in this methodology decomposes into two layers:
| Layer | What it does | How it runs |
|---|---|---|
| LLM-Eval | Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" | Prompt template → LLM → structured YAML output |
| Deterministic | Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection | Python code in process_chapters.py or dedicated metrics.py |
The LLM-Eval layer produces per-entity or per-pair structured scores. The deterministic layer aggregates these into collection-level metrics, persisted as machine-readable YAML alongside human-readable markdown reports.
Per-concept quality metrics (definition precision, source grounding, VSM relevance — see INFRA-TASKS 8-12) operate at the individual entity level. This document covers the five collection-level concerns that assess how the entities work together as an explanatory system.
2. Five Collection-Level Concerns
Overview
| # | Concern | Question | Primary framework |
|---|---|---|---|
| C1 | Semantic Overlap | Are there redundant concepts? | OOPS! P2, embedding similarity |
| C2 | Coverage Completeness | Does the concept set cover the domain? | SEQUAL, FCA |
| C3 | Structural Coherence | Do concepts form a connected explanatory graph? | OntoQA, graph theory |
| C4 | Definitional Consistency | Are concepts defined consistently and non-circularly? | OntoClean, OOPS! P24 |
| C5 | Granularity Balance | Are concepts at comparable levels of abstraction? | Granularity theory, DSL laconicity |
3. Theoretical Frameworks
3.1 SEQUAL (Semiotic Quality Framework)
Origin: Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.
What it defines: Quality of a conceptual model as the correspondence between three worlds — the domain (what exists), the model (what we captured), and the audience's interpretation (what they understand).
Two key dimensions of semantic quality:
- Validity — everything in the model corresponds to something real in the domain. No invented concepts.
- Completeness — everything relevant in the domain is represented in the model. No missing concepts.
How we use it: SEQUAL frames our entire metrics approach. Every collection-level metric maps to one of these dimensions:
| SEQUAL dimension | Our concerns |
|---|---|
| Validity | C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid) |
| Completeness | C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps) |
| Both | C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity]) |
Adaptation: SEQUAL was designed for formal models evaluated by human experts. We replace human judgment with LLM-Eval (for validity checks like "does this concept correspond to something Smith actually described?") and deterministic counting (for completeness checks like "which VSM systems lack entity mappings?").
3.2 OntoClean
Origin: Guarino & Welty (2004).
What it defines: A methodology for validating taxonomic relationships by assigning metaproperties to each concept:
- Rigidity — Is the property essential to all its instances? (e.g. "market" is rigid; "effectual demander" is anti-rigid — an agent can stop being an effectual demander)
- Identity — Does the concept carry an identity criterion? (e.g. "division of labour" can be identified by its three causal mechanisms)
- Unity — Are all instances of this concept whole in the same way?
- Dependence — Does the concept require another concept to exist? (e.g. "market price" depends on "effectual demand")
Constraint: A rigid concept cannot be subsumed by an anti-rigid one. Violations indicate structural confusion.
How we use it: We do not have a formal taxonomy, but our flat entity set implicitly contains subsumption relationships (e.g. "natural rate" subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:
- Granularity mismatches (C5): A rigid concept at the same level as an anti-rigid one suggests different abstraction levels are mixed.
- Definitional consistency (C4): If entity A depends on entity B per OntoClean, but B's definition doesn't acknowledge A, the definitions are inconsistent.
- Redundancy (C1): Two entities with identical metaproperty profiles and overlapping definitions are candidates for merging.
Adaptation: Instead of manual metaproperty assignment, we use LLM-Eval to classify each entity's rigidity, identity criterion, and dependencies. The constraint checking is then deterministic.
3.3 OOPS! (Ontology Pitfall Scanner)
Origin: Poveda-Villalón et al. (2014). Catalogue of 41 common ontology design pitfalls.
What it defines: Concrete, testable anti-patterns. The pitfalls most relevant to our infospace:
| Pitfall | Description | Our concern |
|---|---|---|
| P2 | Synonymous classes — different names, same meaning | C1 (redundancy) |
| P4 | Unconnected ontology elements | C3 (coherence) |
| P6 | Missing inverse relationships | C3 |
| P7 | Merging different concepts in the same class | C5 (granularity — too coarse) |
| P11 | Missing domain or range | C4 (consistency) |
| P19 | Missing disjointness axioms | C1 (how do we know two concepts don't overlap?) |
| P24 | Recursive/circular definition | C4 (consistency) |
| P25 | Inverse of itself | C4 |
How we use it: OOPS! pitfalls become a checklist for LLM-Eval prompts. Rather than running a formal OWL scanner, we ask the LLM to check for each pitfall pattern:
- "Are entities A and B synonymous?" (P2)
- "Does entity A's definition reference itself?" (P24)
- "Is entity A actually two distinct concepts merged together?" (P7)
The deterministic layer counts pitfall occurrences and tracks them over time.
Adaptation: We select the subset of OOPS! pitfalls applicable to semi-formal markdown-based ontologies (no OWL axioms) and implement each as an LLM-Eval prompt pattern rather than a formal reasoner check.
3.4 OntoQA (Metric-Based Ontology Quality Analysis)
Origin: Tartir & Arpinar (2007).
What it defines: Quantitative schema-level and instance-level metrics:
- Relationship Richness (RR): Proportion of non-taxonomic (lateral)
relationships to total relationships.
RR = non_hierarchical / total. Low RR = mere taxonomy. High RR = rich cross-cutting connections. - Attribute Richness (AR): Average number of attributes per concept.
AR = total_attributes / total_concepts. - Inheritance Richness (IR): Average subclasses per class — measures how knowledge distributes across the hierarchy.
- Class Richness (CR): Proportion of classes with instances.
How we use it: Our entities don't have formal relationships declared between them, but we can infer a relationship graph from their definitions and mappings:
- Entity A references entity B in its definition → definitional dependency
- Entities A and B map to the same VSM system → structural co-occurrence
- Entities A and B appear in the same chapter → contextual co-occurrence
From this inferred graph, we compute OntoQA metrics directly:
- Relationship Richness tells us whether our concepts form a web of explanatory connections or just a flat list.
- Attribute Richness maps to our schema sections — entities with more optional sections filled (Original Wording, Modern Interpretation) are richer.
Adaptation: The key modification is that relationship inference is an LLM-Eval step (pairwise: "does A's definition depend on or reference B?"), after which all OntoQA metrics are computed deterministically on the resulting graph.
3.5 Formal Concept Analysis (FCA)
Origin: Wille (1982). Applied to ontology auditing by Elhaj et al. (2008) for SNOMED CT completeness checking.
What it defines: A mathematical framework for deriving a concept lattice from a binary relation between objects and attributes. The lattice reveals:
- Formal concepts: maximal sets of objects sharing the same attributes
- Subconcept/superconcept relationships: the natural hierarchy
- Missing concepts: attribute combinations with no corresponding object
How we use it: We construct a formal context (binary matrix):
- Objects = our 85 entities
- Attributes = economic domain, VSM system, source book, abstraction level (from LLM-Eval), key terms (extracted from definitions)
The concept lattice then reveals:
- Coverage gaps (C2): Attribute combinations with no entity. E.g. if the cell {Distribution, S3} is empty, we lack control-layer concepts for distribution — a specific, actionable gap.
- Redundancy (C1): Entities with identical attribute sets (same formal concept) are candidates for merging.
- Granularity (C5): The lattice depth indicates how many meaningful levels of abstraction exist. A shallow lattice suggests missing intermediate concepts.
Adaptation: Classic FCA requires crisp binary attributes. Our domains
and VSM mappings are already categorical, but abstraction level and key
terms need LLM-Eval to produce. The lattice computation itself is
deterministic (Python concepts library or equivalent). The FCA approach
replaces the current "ask the LLM about coverage" with a structural
computation that can identify specific gaps rather than vague
recommendations.
3.6 DSL Design Principles
Origin: Mernik et al. (2005) "When and How to Develop DSLs"; Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".
What they define: Quality criteria for a set of concepts that form a language for a specific domain:
- Soundness: Every concept in the language corresponds to a real domain concern (no invented abstractions).
- Completeness: The language can express everything needed for its intended tasks.
- Laconicity: No unnecessary concepts — every concept earns its place.
- Orthogonality: Concepts are independent; combining any two produces a meaningful result (no redundant combinations).
How we use it: Our entity set is effectively a domain-specific vocabulary for "explaining classical economics through VSM". DSL quality criteria translate directly:
- Soundness → Validity (SEQUAL): every entity grounded in Smith's text
- Completeness → Coverage (C2): can we answer the "competency questions" the infospace is meant to address?
- Laconicity → Anti-redundancy (C1) + Indispensability (C5): would removing any entity lose explanatory power?
- Orthogonality → Non-overlap (C1): entity definitions don't substantially duplicate each other
Adaptation: We operationalise DSL completeness through competency questions — a set of canonical questions the infospace should be able to answer (e.g. "How does the division of labour relate to market extent?", "What mechanisms regulate wages toward their natural rate?"). LLM-Eval tests whether the current entity set suffices to answer each question. Unanswerable questions identify specific completeness gaps.
Laconicity is operationalised as indispensability scoring: for each entity, LLM-Eval rates whether removing it would lose explanatory power. Low-scoring entities are candidates for merging or retirement.
4. Integration: Metric Definitions by Concern
C1: Semantic Overlap / Redundancy
Goal: Identify entities that substantially overlap in meaning and should be merged, distinguished, or retired.
Metrics:
| Metric | Type | Computation |
|---|---|---|
similarity_matrix |
Deterministic | Embed all entity definitions; compute NxN cosine similarity |
high_similarity_pairs |
Deterministic | Pairs with cosine > 0.80, sorted descending |
confirmed_synonyms |
LLM-Eval | For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap" |
redundancy_ratio |
Deterministic | confirmed_synonyms / total_entities |
intensional_conciseness |
Deterministic | 1 - redundancy_ratio (from KG quality framework) |
Pipeline:
- Embed definitions (embedding API or local model)
- Compute cosine similarity matrix
- Filter pairs above threshold
- LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
- Aggregate into ratio and conciseness score
Output: output/metrics/redundancy-report.md + structured YAML with
pair list, scores, and merge/retire recommendations.
C2: Coverage Completeness
Goal: Identify domain areas that are structurally sparse or isolated within the corpus — and separately, assess whether the entity set can answer the infospace's declared competency questions.
What the deterministic check actually computes
The current implementation builds a binary domain × chapter cross-table: one row per economic domain, one column per source chapter. A cell is populated if at least one entity has that (domain, chapter) combination.
coverage_ratio = populated_cells / (n_domains × n_chapters)
This is not the same as VSM coverage. The domain × VSM matrix described
in earlier versions of this document requires VSM system mappings to be
supplied as extra_attributes to check_coverage(). The pipeline does not
currently do this, so coverage_ratio reflects cross-chapter domain
distribution, not VSM system coverage.
Important: interpret the distribution, not just the ratio
The aggregate ratio conflates two structurally different situations:
| Situation | coverage_ratio | What it means |
|---|---|---|
| Healthy topic separation | Low | Domains are locally dense within their book/section — expected for a multi-topic corpus |
| Fragmented extraction | Low | Domains appear sporadically everywhere, never anchored |
Both produce the same ratio. Use the per-domain density distribution to distinguish them:
| Metric | Meaning |
|---|---|
domain_densities |
Per-domain fraction of chapters containing ≥1 entity with that domain |
density_std |
Standard deviation of densities. High std → healthy topic separation (bimodal: some domains cross-cutting, others local). Low std → uniform but thin. |
cross_cutting_ratio |
Fraction of domains appearing in >50 % of chapters — the foundational, cross-cutting concepts. |
Example interpretation for the WoN/VSM infospace (1021 entities, 35 chapters):
Exchange 0.848 ████████████████ cross-cutting
Regulation 0.848 ████████████████ cross-cutting
General Theory 0.727 ██████████████ cross-cutting
Production 0.636 ████████████ cross-cutting
Distribution 0.576 ███████████ borderline
Accumulation 0.364 ███████ book-specific
Consumption 0.333 ██████ book-specific
density_std = 0.33 (high → healthy topic separation)
cross_cutting_ratio = 0.50
coverage_ratio = 0.44 (below 0.50 threshold, but for correct reasons)
What coverage does NOT capture
- Entity-to-entity connections — whether concepts reference each other, form explanatory chains, or cluster coherently. That is C3 (Structural Coherence).
- VSM competency question answerability — whether current entities collectively support answering the declared competency questions. That requires LLM-Eval and is a planned metric (see below).
- Whether absent (domain, chapter) cells are meaningful gaps or expected absences — the ratio treats them identically.
Threshold guidance
min: 0.50is appropriate for a focused, single-topic corpus where all chapters address the same set of domains.- For heterogeneous multi-book corpora, domains introduced late create empty
cells for all earlier chapters. A threshold of
0.30–0.40is more realistic. - Prefer
cross_cutting_ratioanddensity_stdas the primary diagnostic signals; usecoverage_ratioonly for trend tracking across snapshots.
Metrics:
| Metric | Type | Computation | Status |
|---|---|---|---|
coverage_ratio |
Deterministic | populated_cells / (n_domains × n_chapters) |
✅ Implemented |
domain_densities |
Deterministic | Per-domain fraction of chapters with ≥1 entity | ✅ Implemented |
density_std |
Deterministic | Std dev of domain densities | ✅ Implemented |
cross_cutting_ratio |
Deterministic | Fraction of domains with density > 0.5 | ✅ Implemented |
empty_cells |
Deterministic | List of unpopulated (domain, chapter) pairs | ✅ Implemented |
fca_gap_concepts |
Deterministic | Attribute combos in FCA lattice with no entity | ✅ Implemented |
domain_vsm_matrix |
Deterministic | Entities per {domain, VSM_system} cell — requires VSM mappings in extra_attributes |
⬜ Not yet wired |
competency_coverage |
LLM-Eval | For each competency question, can it be answered? | ⬜ Not yet implemented |
Pipeline (current):
- Parse entity metadata (domain, source chapter) from entity files
- Build domain × chapter binary matrix; identify empty cells
- Compute per-domain densities, std dev, cross-cutting ratio
- Build FCA formal context; extract gap concepts
- Aggregate into
CoverageReport
Output: Snapshot recorded in output/metrics/history.yaml. A
coverage-report.md per chapter is planned but not yet generated.
C3: Structural Coherence
Goal: Determine whether the entities form a connected explanatory web or a fragmented collection of isolated concepts.
Metrics:
| Metric | Type | Computation |
|---|---|---|
relationship_graph |
LLM-Eval + Deterministic | Infer edges from definition cross-references (string matching) + LLM judgment for implicit references |
connected_components |
Deterministic | Number of connected components in the graph (target: 1) |
graph_density |
Deterministic | actual_edges / possible_edges |
avg_degree |
Deterministic | total_edges / total_entities |
relationship_richness |
Deterministic | OntoQA RR: non_hierarchical_edges / total_edges |
modularity |
Deterministic | Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation) |
bridge_concepts |
Deterministic | Entities with highest betweenness centrality (connect clusters) |
orphan_entities |
Deterministic | Entities with degree 0 or 1 |
cohesion_by_domain |
Deterministic | Avg intra-domain edges per entity |
coupling_across_domains |
Deterministic | Inter-domain edges / total edges |
Pipeline:
- Extract explicit cross-references from definitions (entity name mentions in other definitions — string matching with slug normalisation)
- For entity pairs not caught by string matching, LLM-Eval: "Does A's definition depend on or reference B's concept?"
- Build directed graph
- Compute graph metrics (networkx or equivalent)
- Run community detection; compare detected communities to declared economic domains
Output: output/metrics/coherence-report.md + YAML with graph
statistics, orphan list, bridge concepts, and community structure.
C4: Definitional Consistency
Goal: Ensure entities are defined consistently, non-circularly, and without contradicting each other.
Metrics:
| Metric | Type | Computation |
|---|---|---|
definitional_dependency_graph |
Deterministic + LLM-Eval | Edges where A's definition uses B's concept |
circular_definitions |
Deterministic | Cycles of length ≤ 3 in the dependency graph |
definition_depth |
Deterministic | Longest dependency chain per entity before reaching a term not in the entity set |
undefined_dependencies |
Deterministic | Terms used in definitions that arguably should be entities but aren't |
pairwise_consistency |
LLM-Eval | For related entity pairs (sharing edges): "Do these definitions contradict each other?" |
source_fidelity |
LLM-Eval | "Does this definition accurately represent what Smith wrote in the cited passage?" |
metaproperty_violations |
LLM-Eval + Deterministic | OntoClean constraint checking after LLM classifies rigidity/identity |
grounding_ratio |
Deterministic | Fraction of entities traceable to primitives without cycles |
Pipeline:
- Build definitional dependency graph (same technique as C3, but directed — A depends on B means A's definition uses B, not vice versa)
- Detect cycles; flag short cycles
- Extract undefined terms (terms matching entity-name patterns that appear in definitions but have no corresponding entity file)
- LLM pairwise consistency check on directly-connected pairs
- LLM source fidelity check (compare definition to source chapter text)
- LLM OntoClean metaproperty classification; deterministic constraint checking
Output: output/metrics/consistency-report.md + YAML with cycle list,
undefined terms, contradiction candidates, and metaproperty violations.
C5: Granularity Balance
Goal: Ensure entities operate at comparable levels of abstraction within their respective domains and perspectives.
Metrics:
| Metric | Type | Computation |
|---|---|---|
abstraction_classification |
LLM-Eval | Classify each entity as theory-level / mechanism-level / observation-level |
scope_score |
LLM-Eval | Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle) |
abstraction_distribution |
Deterministic | Count per level; compute entropy |
scope_variance |
Deterministic | Variance of scope scores within each domain |
level_x_perspective_matrix |
Deterministic | Cross-tabulation of abstraction level × economic domain |
indispensability |
LLM-Eval | "If removed, what explanatory power is lost?" (1-5) |
dispensable_entities |
Deterministic | Entities with indispensability score ≤ 2 |
merge_candidates |
LLM-Eval | Pairs where one is a sub-case of the other |
Pipeline:
- LLM-classify each entity: abstraction level, scope score, indispensability
- Build level × perspective matrix
- Compute distribution entropy and per-domain scope variance
- Flag outliers: entities whose scope score deviates > 1.5σ from their domain mean
- For outlier entities, LLM-Eval: "Should this be merged into a broader concept, or split into sub-concepts?"
Output: output/metrics/granularity-report.md + YAML with
classifications, distribution, outliers, and merge/split recommendations.
5. Shared Infrastructure
Several concerns share underlying computations:
| Infrastructure | Used by | Build once |
|---|---|---|
| Definition embeddings (vector per entity) | C1, C3 | Embedding API call per entity |
| Relationship graph (entity → entity edges) | C3, C4 | String matching + LLM-Eval |
| FCA formal context (entity × attribute matrix) | C2, C5 | Metadata parsing + LLM classification |
| Entity metadata index (domain, VSM, chapter, sections) | C2, C5, C10 (schema compliance) | Deterministic markdown parsing |
These should be computed once per evaluation run and cached for use by all concern-specific metrics.
6. Evaluation Workflow
A full collection-level evaluation run:
process_chapters.py --evaluate-collection --provider <provider>
- Parse — deterministic metadata extraction from all entity files
- Embed — compute definition embeddings (cached; only new/changed entities need fresh embeddings)
- Infer — LLM-Eval for relationship edges, metaproperties, abstraction levels, pairwise judgments (batched to minimise LLM calls)
- Compute — deterministic graph metrics, FCA lattice, coverage matrix, similarity matrix, cycle detection
- Aggregate — combine per-entity and per-pair scores into collection-level metrics
- Report — write per-concern markdown reports + unified
metrics.yaml - Append — add timestamped snapshot to
metrics-history.yaml
Incremental mode (--evaluate-collection --chapter <id>) re-evaluates
only the entities introduced or modified by that chapter, plus any
pairwise checks involving those entities.
7. References
-
Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding Quality in Conceptual Modeling." IEEE Software 11(2), 42-49. → SEQUAL framework: validity and completeness dimensions.
-
Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In Handbook on Ontologies, Springer, 151-171. → Metaproperty analysis: rigidity, identity, unity, dependence.
-
Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014). "OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation." IJSWIS 10(2), 7-34. → Pitfall catalogue: 41 anti-patterns for ontology design.
-
Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking using OntoQA." ICSC 2007, IEEE, 185-192. → Schema metrics: relationship richness, attribute richness.
-
Wille, R. (1982). "Restructuring Lattice Theory." In Ordered Sets, Reidel, 445-470. → Formal Concept Analysis: concept lattices from binary contexts.
-
Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept Analysis." AMIA Annual Symposium, PMC2605587. → FCA for ontology completeness auditing.
-
Keet, C.M. (2008). A Formal Theory of Granularity. PhD thesis, Free University of Bozen-Bolzano. → Granularity levels and perspectives for ontology design.
-
Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to Develop Domain-Specific Languages." ACM Computing Surveys 37(4), 316-344. → DSL design: soundness, completeness, laconicity.
-
Karsai, G. et al. (2014). "Design Guidelines for Domain Specific Languages." arXiv:1409.2378. → Orthogonality, necessary-and-sufficient principle.
-
Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A Comprehensive Survey." IEEE TKDE 35(5), 4969-4988. → KG quality dimensions: conciseness, consistency, completeness.