Files

tegwick dfe56a4f9b docs(metrics): clarify C2 coverage — domain×chapter matrix, not domain×VSM

- coverage.py: rewrite module docstring to explain what the metric actually
  computes (domain × chapter cross-tabulation, not VSM system coverage),
  what it does not capture (entity connectivity → C3), and when the
  threshold is appropriate
- CoverageReport: add domain_densities, density_std, cross_cutting_ratio
  for distribution-level insight beyond the aggregate ratio
- check_coverage: compute per-domain density and cross-cutting ratio
- METRICS-METHODOLOGY.md: correct C2 section to match implementation,
  document the distribution-based interpretation, add implementation status
  table distinguishing what is wired vs planned

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 00:08:46 +01:00

26 KiB

Raw Blame History

Collection-Level Metrics Methodology

How we evaluate the quality of the infospace as a collection of interrelated concepts, beyond the quality of individual entities.

This document describes the theoretical frameworks drawn from ontology engineering, formal concept analysis, semiotic quality theory, and DSL design — and how each is adapted to work within MarkiTect's two-layer evaluation model (LLM-Eval + deterministic aggregation).

1. The Two-Layer Model

Every metric in this methodology decomposes into two layers:

Layer	What it does	How it runs
LLM-Eval	Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?"	Prompt template → LLM → structured YAML output
Deterministic	Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection	Python code in `process_chapters.py` or dedicated `metrics.py`

The LLM-Eval layer produces per-entity or per-pair structured scores. The deterministic layer aggregates these into collection-level metrics, persisted as machine-readable YAML alongside human-readable markdown reports.

Per-concept quality metrics (definition precision, source grounding, VSM relevance — see INFRA-TASKS 8-12) operate at the individual entity level. This document covers the five collection-level concerns that assess how the entities work together as an explanatory system.

2. Five Collection-Level Concerns

Overview

#	Concern	Question	Primary framework
C1	Semantic Overlap	Are there redundant concepts?	OOPS! P2, embedding similarity
C2	Coverage Completeness	Does the concept set cover the domain?	SEQUAL, FCA
C3	Structural Coherence	Do concepts form a connected explanatory graph?	OntoQA, graph theory
C4	Definitional Consistency	Are concepts defined consistently and non-circularly?	OntoClean, OOPS! P24
C5	Granularity Balance	Are concepts at comparable levels of abstraction?	Granularity theory, DSL laconicity

3. Theoretical Frameworks

3.1 SEQUAL (Semiotic Quality Framework)

Origin: Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.

What it defines: Quality of a conceptual model as the correspondence between three worlds — the domain (what exists), the model (what we captured), and the audience's interpretation (what they understand).

Two key dimensions of semantic quality:

Validity — everything in the model corresponds to something real in the domain. No invented concepts.
Completeness — everything relevant in the domain is represented in the model. No missing concepts.

How we use it: SEQUAL frames our entire metrics approach. Every collection-level metric maps to one of these dimensions:

SEQUAL dimension	Our concerns
Validity	C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid)
Completeness	C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps)
Both	C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity])

Adaptation: SEQUAL was designed for formal models evaluated by human experts. We replace human judgment with LLM-Eval (for validity checks like "does this concept correspond to something Smith actually described?") and deterministic counting (for completeness checks like "which VSM systems lack entity mappings?").

3.2 OntoClean

Origin: Guarino & Welty (2004).

What it defines: A methodology for validating taxonomic relationships by assigning metaproperties to each concept:

Rigidity — Is the property essential to all its instances? (e.g. "market" is rigid; "effectual demander" is anti-rigid — an agent can stop being an effectual demander)
Identity — Does the concept carry an identity criterion? (e.g. "division of labour" can be identified by its three causal mechanisms)
Unity — Are all instances of this concept whole in the same way?
Dependence — Does the concept require another concept to exist? (e.g. "market price" depends on "effectual demand")

Constraint: A rigid concept cannot be subsumed by an anti-rigid one. Violations indicate structural confusion.

How we use it: We do not have a formal taxonomy, but our flat entity set implicitly contains subsumption relationships (e.g. "natural rate" subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:

Granularity mismatches (C5): A rigid concept at the same level as an anti-rigid one suggests different abstraction levels are mixed.
Definitional consistency (C4): If entity A depends on entity B per OntoClean, but B's definition doesn't acknowledge A, the definitions are inconsistent.
Redundancy (C1): Two entities with identical metaproperty profiles and overlapping definitions are candidates for merging.

Adaptation: Instead of manual metaproperty assignment, we use LLM-Eval to classify each entity's rigidity, identity criterion, and dependencies. The constraint checking is then deterministic.

3.3 OOPS! (Ontology Pitfall Scanner)

Origin: Poveda-Villalón et al. (2014). Catalogue of 41 common ontology design pitfalls.

What it defines: Concrete, testable anti-patterns. The pitfalls most relevant to our infospace:

Pitfall	Description	Our concern
P2	Synonymous classes — different names, same meaning	C1 (redundancy)
P4	Unconnected ontology elements	C3 (coherence)
P6	Missing inverse relationships	C3
P7	Merging different concepts in the same class	C5 (granularity — too coarse)
P11	Missing domain or range	C4 (consistency)
P19	Missing disjointness axioms	C1 (how do we know two concepts don't overlap?)
P24	Recursive/circular definition	C4 (consistency)
P25	Inverse of itself	C4

How we use it: OOPS! pitfalls become a checklist for LLM-Eval prompts. Rather than running a formal OWL scanner, we ask the LLM to check for each pitfall pattern:

"Are entities A and B synonymous?" (P2)
"Does entity A's definition reference itself?" (P24)
"Is entity A actually two distinct concepts merged together?" (P7)

The deterministic layer counts pitfall occurrences and tracks them over time.

Adaptation: We select the subset of OOPS! pitfalls applicable to semi-formal markdown-based ontologies (no OWL axioms) and implement each as an LLM-Eval prompt pattern rather than a formal reasoner check.

3.4 OntoQA (Metric-Based Ontology Quality Analysis)

Origin: Tartir & Arpinar (2007).

What it defines: Quantitative schema-level and instance-level metrics:

Relationship Richness (RR): Proportion of non-taxonomic (lateral) relationships to total relationships. RR = non_hierarchical / total. Low RR = mere taxonomy. High RR = rich cross-cutting connections.
Attribute Richness (AR): Average number of attributes per concept. AR = total_attributes / total_concepts.
Inheritance Richness (IR): Average subclasses per class — measures how knowledge distributes across the hierarchy.
Class Richness (CR): Proportion of classes with instances.

How we use it: Our entities don't have formal relationships declared between them, but we can infer a relationship graph from their definitions and mappings:

Entity A references entity B in its definition → definitional dependency
Entities A and B map to the same VSM system → structural co-occurrence
Entities A and B appear in the same chapter → contextual co-occurrence

From this inferred graph, we compute OntoQA metrics directly:

Relationship Richness tells us whether our concepts form a web of explanatory connections or just a flat list.
Attribute Richness maps to our schema sections — entities with more optional sections filled (Original Wording, Modern Interpretation) are richer.

Adaptation: The key modification is that relationship inference is an LLM-Eval step (pairwise: "does A's definition depend on or reference B?"), after which all OntoQA metrics are computed deterministically on the resulting graph.

3.5 Formal Concept Analysis (FCA)

Origin: Wille (1982). Applied to ontology auditing by Elhaj et al. (2008) for SNOMED CT completeness checking.

What it defines: A mathematical framework for deriving a concept lattice from a binary relation between objects and attributes. The lattice reveals:

Formal concepts: maximal sets of objects sharing the same attributes
Subconcept/superconcept relationships: the natural hierarchy
Missing concepts: attribute combinations with no corresponding object

How we use it: We construct a formal context (binary matrix):

Objects = our 85 entities
Attributes = economic domain, VSM system, source book, abstraction level (from LLM-Eval), key terms (extracted from definitions)

The concept lattice then reveals:

Coverage gaps (C2): Attribute combinations with no entity. E.g. if the cell {Distribution, S3} is empty, we lack control-layer concepts for distribution — a specific, actionable gap.
Redundancy (C1): Entities with identical attribute sets (same formal concept) are candidates for merging.
Granularity (C5): The lattice depth indicates how many meaningful levels of abstraction exist. A shallow lattice suggests missing intermediate concepts.

Adaptation: Classic FCA requires crisp binary attributes. Our domains and VSM mappings are already categorical, but abstraction level and key terms need LLM-Eval to produce. The lattice computation itself is deterministic (Python concepts library or equivalent). The FCA approach replaces the current "ask the LLM about coverage" with a structural computation that can identify specific gaps rather than vague recommendations.

3.6 DSL Design Principles

Origin: Mernik et al. (2005) "When and How to Develop DSLs"; Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".

What they define: Quality criteria for a set of concepts that form a language for a specific domain:

Soundness: Every concept in the language corresponds to a real domain concern (no invented abstractions).
Completeness: The language can express everything needed for its intended tasks.
Laconicity: No unnecessary concepts — every concept earns its place.
Orthogonality: Concepts are independent; combining any two produces a meaningful result (no redundant combinations).

How we use it: Our entity set is effectively a domain-specific vocabulary for "explaining classical economics through VSM". DSL quality criteria translate directly:

Soundness → Validity (SEQUAL): every entity grounded in Smith's text
Completeness → Coverage (C2): can we answer the "competency questions" the infospace is meant to address?
Laconicity → Anti-redundancy (C1) + Indispensability (C5): would removing any entity lose explanatory power?
Orthogonality → Non-overlap (C1): entity definitions don't substantially duplicate each other

Adaptation: We operationalise DSL completeness through competency questions — a set of canonical questions the infospace should be able to answer (e.g. "How does the division of labour relate to market extent?", "What mechanisms regulate wages toward their natural rate?"). LLM-Eval tests whether the current entity set suffices to answer each question. Unanswerable questions identify specific completeness gaps.

Laconicity is operationalised as indispensability scoring: for each entity, LLM-Eval rates whether removing it would lose explanatory power. Low-scoring entities are candidates for merging or retirement.

4. Integration: Metric Definitions by Concern

C1: Semantic Overlap / Redundancy

Goal: Identify entities that substantially overlap in meaning and should be merged, distinguished, or retired.

Metrics:

Metric	Type	Computation
`similarity_matrix`	Deterministic	Embed all entity definitions; compute NxN cosine similarity
`high_similarity_pairs`	Deterministic	Pairs with cosine > 0.80, sorted descending
`confirmed_synonyms`	LLM-Eval	For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap"
`redundancy_ratio`	Deterministic	`confirmed_synonyms / total_entities`
`intensional_conciseness`	Deterministic	`1 - redundancy_ratio` (from KG quality framework)

Pipeline:

Embed definitions (embedding API or local model)
Compute cosine similarity matrix
Filter pairs above threshold
LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
Aggregate into ratio and conciseness score

Output: output/metrics/redundancy-report.md + structured YAML with pair list, scores, and merge/retire recommendations.

C2: Coverage Completeness

Goal: Identify domain areas that are structurally sparse or isolated within the corpus — and separately, assess whether the entity set can answer the infospace's declared competency questions.

What the deterministic check actually computes

The current implementation builds a binary domain × chapter cross-table: one row per economic domain, one column per source chapter. A cell is populated if at least one entity has that (domain, chapter) combination.

coverage_ratio = populated_cells / (n_domains × n_chapters)

This is not the same as VSM coverage. The domain × VSM matrix described in earlier versions of this document requires VSM system mappings to be supplied as extra_attributes to check_coverage(). The pipeline does not currently do this, so coverage_ratio reflects cross-chapter domain distribution, not VSM system coverage.

Important: interpret the distribution, not just the ratio

The aggregate ratio conflates two structurally different situations:

Situation	coverage_ratio	What it means
Healthy topic separation	Low	Domains are locally dense within their book/section — expected for a multi-topic corpus
Fragmented extraction	Low	Domains appear sporadically everywhere, never anchored

Both produce the same ratio. Use the per-domain density distribution to distinguish them:

Metric	Meaning
`domain_densities`	Per-domain fraction of chapters containing ≥1 entity with that domain
`density_std`	Standard deviation of densities. High std → healthy topic separation (bimodal: some domains cross-cutting, others local). Low std → uniform but thin.
`cross_cutting_ratio`	Fraction of domains appearing in >50 % of chapters — the foundational, cross-cutting concepts.

Example interpretation for the WoN/VSM infospace (1021 entities, 35 chapters):

Exchange        0.848  ████████████████   cross-cutting
Regulation      0.848  ████████████████   cross-cutting
General Theory  0.727  ██████████████     cross-cutting
Production      0.636  ████████████       cross-cutting
Distribution    0.576  ███████████        borderline
Accumulation    0.364  ███████            book-specific
Consumption     0.333  ██████             book-specific

density_std = 0.33   (high → healthy topic separation)
cross_cutting_ratio = 0.50
coverage_ratio = 0.44  (below 0.50 threshold, but for correct reasons)

What coverage does NOT capture

Entity-to-entity connections — whether concepts reference each other, form explanatory chains, or cluster coherently. That is C3 (Structural Coherence).
VSM competency question answerability — whether current entities collectively support answering the declared competency questions. That requires LLM-Eval and is a planned metric (see below).
Whether absent (domain, chapter) cells are meaningful gaps or expected absences — the ratio treats them identically.

Threshold guidance

min: 0.50 is appropriate for a focused, single-topic corpus where all chapters address the same set of domains.
For heterogeneous multi-book corpora, domains introduced late create empty cells for all earlier chapters. A threshold of 0.30–0.40 is more realistic.
Prefer cross_cutting_ratio and density_std as the primary diagnostic signals; use coverage_ratio only for trend tracking across snapshots.

Metrics:

Metric	Type	Computation	Status
`coverage_ratio`	Deterministic	`populated_cells / (n_domains × n_chapters)`	✅ Implemented
`domain_densities`	Deterministic	Per-domain fraction of chapters with ≥1 entity	✅ Implemented
`density_std`	Deterministic	Std dev of domain densities	✅ Implemented
`cross_cutting_ratio`	Deterministic	Fraction of domains with density > 0.5	✅ Implemented
`empty_cells`	Deterministic	List of unpopulated (domain, chapter) pairs	✅ Implemented
`fca_gap_concepts`	Deterministic	Attribute combos in FCA lattice with no entity	✅ Implemented
`domain_vsm_matrix`	Deterministic	Entities per {domain, VSM_system} cell — requires VSM mappings in `extra_attributes`	⬜ Not yet wired
`competency_coverage`	LLM-Eval	For each competency question, can it be answered?	⬜ Not yet implemented

Pipeline (current):

Parse entity metadata (domain, source chapter) from entity files
Build domain × chapter binary matrix; identify empty cells
Compute per-domain densities, std dev, cross-cutting ratio
Build FCA formal context; extract gap concepts
Aggregate into CoverageReport

Output: Snapshot recorded in output/metrics/history.yaml. A coverage-report.md per chapter is planned but not yet generated.

C3: Structural Coherence

Goal: Determine whether the entities form a connected explanatory web or a fragmented collection of isolated concepts.

Metrics:

Metric	Type	Computation
`relationship_graph`	LLM-Eval + Deterministic	Infer edges from definition cross-references (string matching) + LLM judgment for implicit references
`connected_components`	Deterministic	Number of connected components in the graph (target: 1)
`graph_density`	Deterministic	`actual_edges / possible_edges`
`avg_degree`	Deterministic	`total_edges / total_entities`
`relationship_richness`	Deterministic	OntoQA RR: `non_hierarchical_edges / total_edges`
`modularity`	Deterministic	Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation)
`bridge_concepts`	Deterministic	Entities with highest betweenness centrality (connect clusters)
`orphan_entities`	Deterministic	Entities with degree 0 or 1
`cohesion_by_domain`	Deterministic	Avg intra-domain edges per entity
`coupling_across_domains`	Deterministic	Inter-domain edges / total edges

Pipeline:

Extract explicit cross-references from definitions (entity name mentions in other definitions — string matching with slug normalisation)
For entity pairs not caught by string matching, LLM-Eval: "Does A's definition depend on or reference B's concept?"
Build directed graph
Compute graph metrics (networkx or equivalent)
Run community detection; compare detected communities to declared economic domains

Output: output/metrics/coherence-report.md + YAML with graph statistics, orphan list, bridge concepts, and community structure.

C4: Definitional Consistency

Goal: Ensure entities are defined consistently, non-circularly, and without contradicting each other.

Metrics:

Metric	Type	Computation
`definitional_dependency_graph`	Deterministic + LLM-Eval	Edges where A's definition uses B's concept
`circular_definitions`	Deterministic	Cycles of length ≤ 3 in the dependency graph
`definition_depth`	Deterministic	Longest dependency chain per entity before reaching a term not in the entity set
`undefined_dependencies`	Deterministic	Terms used in definitions that arguably should be entities but aren't
`pairwise_consistency`	LLM-Eval	For related entity pairs (sharing edges): "Do these definitions contradict each other?"
`source_fidelity`	LLM-Eval	"Does this definition accurately represent what Smith wrote in the cited passage?"
`metaproperty_violations`	LLM-Eval + Deterministic	OntoClean constraint checking after LLM classifies rigidity/identity
`grounding_ratio`	Deterministic	Fraction of entities traceable to primitives without cycles

Pipeline:

Build definitional dependency graph (same technique as C3, but directed — A depends on B means A's definition uses B, not vice versa)
Detect cycles; flag short cycles
Extract undefined terms (terms matching entity-name patterns that appear in definitions but have no corresponding entity file)
LLM pairwise consistency check on directly-connected pairs
LLM source fidelity check (compare definition to source chapter text)
LLM OntoClean metaproperty classification; deterministic constraint checking

Output: output/metrics/consistency-report.md + YAML with cycle list, undefined terms, contradiction candidates, and metaproperty violations.

C5: Granularity Balance

Goal: Ensure entities operate at comparable levels of abstraction within their respective domains and perspectives.

Metrics:

Metric	Type	Computation
`abstraction_classification`	LLM-Eval	Classify each entity as theory-level / mechanism-level / observation-level
`scope_score`	LLM-Eval	Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle)
`abstraction_distribution`	Deterministic	Count per level; compute entropy
`scope_variance`	Deterministic	Variance of scope scores within each domain
`level_x_perspective_matrix`	Deterministic	Cross-tabulation of abstraction level × economic domain
`indispensability`	LLM-Eval	"If removed, what explanatory power is lost?" (1-5)
`dispensable_entities`	Deterministic	Entities with indispensability score ≤ 2
`merge_candidates`	LLM-Eval	Pairs where one is a sub-case of the other

Pipeline:

LLM-classify each entity: abstraction level, scope score, indispensability
Build level × perspective matrix
Compute distribution entropy and per-domain scope variance
Flag outliers: entities whose scope score deviates > 1.5σ from their domain mean
For outlier entities, LLM-Eval: "Should this be merged into a broader concept, or split into sub-concepts?"

Output: output/metrics/granularity-report.md + YAML with classifications, distribution, outliers, and merge/split recommendations.

5. Shared Infrastructure

Several concerns share underlying computations:

Infrastructure	Used by	Build once
Definition embeddings (vector per entity)	C1, C3	Embedding API call per entity
Relationship graph (entity → entity edges)	C3, C4	String matching + LLM-Eval
FCA formal context (entity × attribute matrix)	C2, C5	Metadata parsing + LLM classification
Entity metadata index (domain, VSM, chapter, sections)	C2, C5, C10 (schema compliance)	Deterministic markdown parsing

These should be computed once per evaluation run and cached for use by all concern-specific metrics.

6. Evaluation Workflow

A full collection-level evaluation run:

process_chapters.py --evaluate-collection --provider <provider>

Parse — deterministic metadata extraction from all entity files
Embed — compute definition embeddings (cached; only new/changed entities need fresh embeddings)
Infer — LLM-Eval for relationship edges, metaproperties, abstraction levels, pairwise judgments (batched to minimise LLM calls)
Compute — deterministic graph metrics, FCA lattice, coverage matrix, similarity matrix, cycle detection
Aggregate — combine per-entity and per-pair scores into collection-level metrics
Report — write per-concern markdown reports + unified metrics.yaml
Append — add timestamped snapshot to metrics-history.yaml

Incremental mode (--evaluate-collection --chapter <id>) re-evaluates only the entities introduced or modified by that chapter, plus any pairwise checks involving those entities.

7. References

Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding Quality in Conceptual Modeling." IEEE Software 11(2), 42-49. → SEQUAL framework: validity and completeness dimensions.
Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In Handbook on Ontologies, Springer, 151-171. → Metaproperty analysis: rigidity, identity, unity, dependence.
Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014). "OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation." IJSWIS 10(2), 7-34. → Pitfall catalogue: 41 anti-patterns for ontology design.
Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking using OntoQA." ICSC 2007, IEEE, 185-192. → Schema metrics: relationship richness, attribute richness.
Wille, R. (1982). "Restructuring Lattice Theory." In Ordered Sets, Reidel, 445-470. → Formal Concept Analysis: concept lattices from binary contexts.
Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept Analysis." AMIA Annual Symposium, PMC2605587. → FCA for ontology completeness auditing.
Keet, C.M. (2008). A Formal Theory of Granularity. PhD thesis, Free University of Bozen-Bolzano. → Granularity levels and perspectives for ontology design.
Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to Develop Domain-Specific Languages." ACM Computing Surveys 37(4), 316-344. → DSL design: soundness, completeness, laconicity.
Karsai, G. et al. (2014). "Design Guidelines for Domain Specific Languages." arXiv:1409.2378. → Orthogonality, necessary-and-sufficient principle.
Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A Comprehensive Survey." IEEE TKDE 35(5), 4969-4988. → KG quality dimensions: conciseness, consistency, completeness.

26 KiB Raw Blame History Unescape Escape

Collection-Level Metrics Methodology

1. The Two-Layer Model

2. Five Collection-Level Concerns

Overview

3. Theoretical Frameworks

3.1 SEQUAL (Semiotic Quality Framework)

3.2 OntoClean

3.3 OOPS! (Ontology Pitfall Scanner)

3.4 OntoQA (Metric-Based Ontology Quality Analysis)

3.5 Formal Concept Analysis (FCA)

3.6 DSL Design Principles

4. Integration: Metric Definitions by Concern

C1: Semantic Overlap / Redundancy

C2: Coverage Completeness

C3: Structural Coherence

C4: Definitional Consistency

C5: Granularity Balance

5. Shared Infrastructure

6. Evaluation Workflow

7. References

26 KiB

Raw Blame History