Files
markitect-main/examples/infospace-with-history/METRICS-METHODOLOGY.md
tegwick 4ce856d4d0 docs: metrics methodology, collection-level tasks, and infospace tooling roadmap
Add METRICS-METHODOLOGY.md documenting the theoretical frameworks
(SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for
two-layer evaluation (LLM-Eval + deterministic aggregation) across
five collection concerns: redundancy, coverage, coherence, consistency,
and granularity balance.

Extend INFRA-TASKS.md with assignment assessment (tasks 4-7),
per-concept metrics (tasks 8-12), and collection-level metrics
(tasks 13-19).

Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace,
topic, discipline, entity, evaluation, viability) and a three-stage
implementation plan: Stage 1 platform additions, Stage 2 infospace
tooling layer, Stage 3 example revision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 23:53:21 +01:00

23 KiB
Raw Blame History

Collection-Level Metrics Methodology

How we evaluate the quality of the infospace as a collection of interrelated concepts, beyond the quality of individual entities.

This document describes the theoretical frameworks drawn from ontology engineering, formal concept analysis, semiotic quality theory, and DSL design — and how each is adapted to work within MarkiTect's two-layer evaluation model (LLM-Eval + deterministic aggregation).


1. The Two-Layer Model

Every metric in this methodology decomposes into two layers:

Layer What it does How it runs
LLM-Eval Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" Prompt template → LLM → structured YAML output
Deterministic Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection Python code in process_chapters.py or dedicated metrics.py

The LLM-Eval layer produces per-entity or per-pair structured scores. The deterministic layer aggregates these into collection-level metrics, persisted as machine-readable YAML alongside human-readable markdown reports.

Per-concept quality metrics (definition precision, source grounding, VSM relevance — see INFRA-TASKS 8-12) operate at the individual entity level. This document covers the five collection-level concerns that assess how the entities work together as an explanatory system.


2. Five Collection-Level Concerns

Overview

# Concern Question Primary framework
C1 Semantic Overlap Are there redundant concepts? OOPS! P2, embedding similarity
C2 Coverage Completeness Does the concept set cover the domain? SEQUAL, FCA
C3 Structural Coherence Do concepts form a connected explanatory graph? OntoQA, graph theory
C4 Definitional Consistency Are concepts defined consistently and non-circularly? OntoClean, OOPS! P24
C5 Granularity Balance Are concepts at comparable levels of abstraction? Granularity theory, DSL laconicity

3. Theoretical Frameworks

3.1 SEQUAL (Semiotic Quality Framework)

Origin: Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.

What it defines: Quality of a conceptual model as the correspondence between three worlds — the domain (what exists), the model (what we captured), and the audience's interpretation (what they understand).

Two key dimensions of semantic quality:

  • Validity — everything in the model corresponds to something real in the domain. No invented concepts.
  • Completeness — everything relevant in the domain is represented in the model. No missing concepts.

How we use it: SEQUAL frames our entire metrics approach. Every collection-level metric maps to one of these dimensions:

SEQUAL dimension Our concerns
Validity C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid)
Completeness C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps)
Both C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity])

Adaptation: SEQUAL was designed for formal models evaluated by human experts. We replace human judgment with LLM-Eval (for validity checks like "does this concept correspond to something Smith actually described?") and deterministic counting (for completeness checks like "which VSM systems lack entity mappings?").

3.2 OntoClean

Origin: Guarino & Welty (2004).

What it defines: A methodology for validating taxonomic relationships by assigning metaproperties to each concept:

  • Rigidity — Is the property essential to all its instances? (e.g. "market" is rigid; "effectual demander" is anti-rigid — an agent can stop being an effectual demander)
  • Identity — Does the concept carry an identity criterion? (e.g. "division of labour" can be identified by its three causal mechanisms)
  • Unity — Are all instances of this concept whole in the same way?
  • Dependence — Does the concept require another concept to exist? (e.g. "market price" depends on "effectual demand")

Constraint: A rigid concept cannot be subsumed by an anti-rigid one. Violations indicate structural confusion.

How we use it: We do not have a formal taxonomy, but our flat entity set implicitly contains subsumption relationships (e.g. "natural rate" subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:

  • Granularity mismatches (C5): A rigid concept at the same level as an anti-rigid one suggests different abstraction levels are mixed.
  • Definitional consistency (C4): If entity A depends on entity B per OntoClean, but B's definition doesn't acknowledge A, the definitions are inconsistent.
  • Redundancy (C1): Two entities with identical metaproperty profiles and overlapping definitions are candidates for merging.

Adaptation: Instead of manual metaproperty assignment, we use LLM-Eval to classify each entity's rigidity, identity criterion, and dependencies. The constraint checking is then deterministic.

3.3 OOPS! (Ontology Pitfall Scanner)

Origin: Poveda-Villalón et al. (2014). Catalogue of 41 common ontology design pitfalls.

What it defines: Concrete, testable anti-patterns. The pitfalls most relevant to our infospace:

Pitfall Description Our concern
P2 Synonymous classes — different names, same meaning C1 (redundancy)
P4 Unconnected ontology elements C3 (coherence)
P6 Missing inverse relationships C3
P7 Merging different concepts in the same class C5 (granularity — too coarse)
P11 Missing domain or range C4 (consistency)
P19 Missing disjointness axioms C1 (how do we know two concepts don't overlap?)
P24 Recursive/circular definition C4 (consistency)
P25 Inverse of itself C4

How we use it: OOPS! pitfalls become a checklist for LLM-Eval prompts. Rather than running a formal OWL scanner, we ask the LLM to check for each pitfall pattern:

  • "Are entities A and B synonymous?" (P2)
  • "Does entity A's definition reference itself?" (P24)
  • "Is entity A actually two distinct concepts merged together?" (P7)

The deterministic layer counts pitfall occurrences and tracks them over time.

Adaptation: We select the subset of OOPS! pitfalls applicable to semi-formal markdown-based ontologies (no OWL axioms) and implement each as an LLM-Eval prompt pattern rather than a formal reasoner check.

3.4 OntoQA (Metric-Based Ontology Quality Analysis)

Origin: Tartir & Arpinar (2007).

What it defines: Quantitative schema-level and instance-level metrics:

  • Relationship Richness (RR): Proportion of non-taxonomic (lateral) relationships to total relationships. RR = non_hierarchical / total. Low RR = mere taxonomy. High RR = rich cross-cutting connections.
  • Attribute Richness (AR): Average number of attributes per concept. AR = total_attributes / total_concepts.
  • Inheritance Richness (IR): Average subclasses per class — measures how knowledge distributes across the hierarchy.
  • Class Richness (CR): Proportion of classes with instances.

How we use it: Our entities don't have formal relationships declared between them, but we can infer a relationship graph from their definitions and mappings:

  • Entity A references entity B in its definition → definitional dependency
  • Entities A and B map to the same VSM system → structural co-occurrence
  • Entities A and B appear in the same chapter → contextual co-occurrence

From this inferred graph, we compute OntoQA metrics directly:

  • Relationship Richness tells us whether our concepts form a web of explanatory connections or just a flat list.
  • Attribute Richness maps to our schema sections — entities with more optional sections filled (Original Wording, Modern Interpretation) are richer.

Adaptation: The key modification is that relationship inference is an LLM-Eval step (pairwise: "does A's definition depend on or reference B?"), after which all OntoQA metrics are computed deterministically on the resulting graph.

3.5 Formal Concept Analysis (FCA)

Origin: Wille (1982). Applied to ontology auditing by Elhaj et al. (2008) for SNOMED CT completeness checking.

What it defines: A mathematical framework for deriving a concept lattice from a binary relation between objects and attributes. The lattice reveals:

  • Formal concepts: maximal sets of objects sharing the same attributes
  • Subconcept/superconcept relationships: the natural hierarchy
  • Missing concepts: attribute combinations with no corresponding object

How we use it: We construct a formal context (binary matrix):

  • Objects = our 85 entities
  • Attributes = economic domain, VSM system, source book, abstraction level (from LLM-Eval), key terms (extracted from definitions)

The concept lattice then reveals:

  • Coverage gaps (C2): Attribute combinations with no entity. E.g. if the cell {Distribution, S3} is empty, we lack control-layer concepts for distribution — a specific, actionable gap.
  • Redundancy (C1): Entities with identical attribute sets (same formal concept) are candidates for merging.
  • Granularity (C5): The lattice depth indicates how many meaningful levels of abstraction exist. A shallow lattice suggests missing intermediate concepts.

Adaptation: Classic FCA requires crisp binary attributes. Our domains and VSM mappings are already categorical, but abstraction level and key terms need LLM-Eval to produce. The lattice computation itself is deterministic (Python concepts library or equivalent). The FCA approach replaces the current "ask the LLM about coverage" with a structural computation that can identify specific gaps rather than vague recommendations.

3.6 DSL Design Principles

Origin: Mernik et al. (2005) "When and How to Develop DSLs"; Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".

What they define: Quality criteria for a set of concepts that form a language for a specific domain:

  • Soundness: Every concept in the language corresponds to a real domain concern (no invented abstractions).
  • Completeness: The language can express everything needed for its intended tasks.
  • Laconicity: No unnecessary concepts — every concept earns its place.
  • Orthogonality: Concepts are independent; combining any two produces a meaningful result (no redundant combinations).

How we use it: Our entity set is effectively a domain-specific vocabulary for "explaining classical economics through VSM". DSL quality criteria translate directly:

  • Soundness → Validity (SEQUAL): every entity grounded in Smith's text
  • Completeness → Coverage (C2): can we answer the "competency questions" the infospace is meant to address?
  • Laconicity → Anti-redundancy (C1) + Indispensability (C5): would removing any entity lose explanatory power?
  • Orthogonality → Non-overlap (C1): entity definitions don't substantially duplicate each other

Adaptation: We operationalise DSL completeness through competency questions — a set of canonical questions the infospace should be able to answer (e.g. "How does the division of labour relate to market extent?", "What mechanisms regulate wages toward their natural rate?"). LLM-Eval tests whether the current entity set suffices to answer each question. Unanswerable questions identify specific completeness gaps.

Laconicity is operationalised as indispensability scoring: for each entity, LLM-Eval rates whether removing it would lose explanatory power. Low-scoring entities are candidates for merging or retirement.


4. Integration: Metric Definitions by Concern

C1: Semantic Overlap / Redundancy

Goal: Identify entities that substantially overlap in meaning and should be merged, distinguished, or retired.

Metrics:

Metric Type Computation
similarity_matrix Deterministic Embed all entity definitions; compute NxN cosine similarity
high_similarity_pairs Deterministic Pairs with cosine > 0.80, sorted descending
confirmed_synonyms LLM-Eval For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap"
redundancy_ratio Deterministic confirmed_synonyms / total_entities
intensional_conciseness Deterministic 1 - redundancy_ratio (from KG quality framework)

Pipeline:

  1. Embed definitions (embedding API or local model)
  2. Compute cosine similarity matrix
  3. Filter pairs above threshold
  4. LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
  5. Aggregate into ratio and conciseness score

Output: output/metrics/redundancy-report.md + structured YAML with pair list, scores, and merge/retire recommendations.

C2: Coverage Completeness

Goal: Identify domain areas and VSM systems that lack adequate representation in the entity set.

Metrics:

Metric Type Computation
domain_vsm_matrix Deterministic Count entities per {economic_domain, VSM_system} cell
coverage_ratio Deterministic populated_cells / expected_cells
vsm_balance_entropy Deterministic Shannon entropy of entity distribution across VSM systems (higher = more balanced)
empty_cells Deterministic List of {domain, VSM_system} pairs with zero entities
competency_coverage LLM-Eval For each competency question, can it be answered with current entities?
fca_gap_concepts Deterministic Attribute combinations in the FCA lattice with no corresponding entity

Pipeline:

  1. Parse entity metadata (domain, VSM mapping) from files on disk
  2. Build domain × VSM matrix; identify empty cells
  3. Build FCA formal context; compute lattice; extract gap concepts
  4. Define competency questions (initially hand-written, later LLM-generated from the source material)
  5. LLM-evaluate answerability of each question
  6. Aggregate into coverage ratio, entropy, and gap list

Output: output/metrics/coverage-report.md + YAML with matrix, gaps, and competency question results.

C3: Structural Coherence

Goal: Determine whether the entities form a connected explanatory web or a fragmented collection of isolated concepts.

Metrics:

Metric Type Computation
relationship_graph LLM-Eval + Deterministic Infer edges from definition cross-references (string matching) + LLM judgment for implicit references
connected_components Deterministic Number of connected components in the graph (target: 1)
graph_density Deterministic actual_edges / possible_edges
avg_degree Deterministic total_edges / total_entities
relationship_richness Deterministic OntoQA RR: non_hierarchical_edges / total_edges
modularity Deterministic Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation)
bridge_concepts Deterministic Entities with highest betweenness centrality (connect clusters)
orphan_entities Deterministic Entities with degree 0 or 1
cohesion_by_domain Deterministic Avg intra-domain edges per entity
coupling_across_domains Deterministic Inter-domain edges / total edges

Pipeline:

  1. Extract explicit cross-references from definitions (entity name mentions in other definitions — string matching with slug normalisation)
  2. For entity pairs not caught by string matching, LLM-Eval: "Does A's definition depend on or reference B's concept?"
  3. Build directed graph
  4. Compute graph metrics (networkx or equivalent)
  5. Run community detection; compare detected communities to declared economic domains

Output: output/metrics/coherence-report.md + YAML with graph statistics, orphan list, bridge concepts, and community structure.

C4: Definitional Consistency

Goal: Ensure entities are defined consistently, non-circularly, and without contradicting each other.

Metrics:

Metric Type Computation
definitional_dependency_graph Deterministic + LLM-Eval Edges where A's definition uses B's concept
circular_definitions Deterministic Cycles of length ≤ 3 in the dependency graph
definition_depth Deterministic Longest dependency chain per entity before reaching a term not in the entity set
undefined_dependencies Deterministic Terms used in definitions that arguably should be entities but aren't
pairwise_consistency LLM-Eval For related entity pairs (sharing edges): "Do these definitions contradict each other?"
source_fidelity LLM-Eval "Does this definition accurately represent what Smith wrote in the cited passage?"
metaproperty_violations LLM-Eval + Deterministic OntoClean constraint checking after LLM classifies rigidity/identity
grounding_ratio Deterministic Fraction of entities traceable to primitives without cycles

Pipeline:

  1. Build definitional dependency graph (same technique as C3, but directed — A depends on B means A's definition uses B, not vice versa)
  2. Detect cycles; flag short cycles
  3. Extract undefined terms (terms matching entity-name patterns that appear in definitions but have no corresponding entity file)
  4. LLM pairwise consistency check on directly-connected pairs
  5. LLM source fidelity check (compare definition to source chapter text)
  6. LLM OntoClean metaproperty classification; deterministic constraint checking

Output: output/metrics/consistency-report.md + YAML with cycle list, undefined terms, contradiction candidates, and metaproperty violations.

C5: Granularity Balance

Goal: Ensure entities operate at comparable levels of abstraction within their respective domains and perspectives.

Metrics:

Metric Type Computation
abstraction_classification LLM-Eval Classify each entity as theory-level / mechanism-level / observation-level
scope_score LLM-Eval Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle)
abstraction_distribution Deterministic Count per level; compute entropy
scope_variance Deterministic Variance of scope scores within each domain
level_x_perspective_matrix Deterministic Cross-tabulation of abstraction level × economic domain
indispensability LLM-Eval "If removed, what explanatory power is lost?" (1-5)
dispensable_entities Deterministic Entities with indispensability score ≤ 2
merge_candidates LLM-Eval Pairs where one is a sub-case of the other

Pipeline:

  1. LLM-classify each entity: abstraction level, scope score, indispensability
  2. Build level × perspective matrix
  3. Compute distribution entropy and per-domain scope variance
  4. Flag outliers: entities whose scope score deviates > 1.5σ from their domain mean
  5. For outlier entities, LLM-Eval: "Should this be merged into a broader concept, or split into sub-concepts?"

Output: output/metrics/granularity-report.md + YAML with classifications, distribution, outliers, and merge/split recommendations.


5. Shared Infrastructure

Several concerns share underlying computations:

Infrastructure Used by Build once
Definition embeddings (vector per entity) C1, C3 Embedding API call per entity
Relationship graph (entity → entity edges) C3, C4 String matching + LLM-Eval
FCA formal context (entity × attribute matrix) C2, C5 Metadata parsing + LLM classification
Entity metadata index (domain, VSM, chapter, sections) C2, C5, C10 (schema compliance) Deterministic markdown parsing

These should be computed once per evaluation run and cached for use by all concern-specific metrics.


6. Evaluation Workflow

A full collection-level evaluation run:

process_chapters.py --evaluate-collection --provider <provider>
  1. Parse — deterministic metadata extraction from all entity files
  2. Embed — compute definition embeddings (cached; only new/changed entities need fresh embeddings)
  3. Infer — LLM-Eval for relationship edges, metaproperties, abstraction levels, pairwise judgments (batched to minimise LLM calls)
  4. Compute — deterministic graph metrics, FCA lattice, coverage matrix, similarity matrix, cycle detection
  5. Aggregate — combine per-entity and per-pair scores into collection-level metrics
  6. Report — write per-concern markdown reports + unified metrics.yaml
  7. Append — add timestamped snapshot to metrics-history.yaml

Incremental mode (--evaluate-collection --chapter <id>) re-evaluates only the entities introduced or modified by that chapter, plus any pairwise checks involving those entities.


7. References

  • Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding Quality in Conceptual Modeling." IEEE Software 11(2), 42-49. → SEQUAL framework: validity and completeness dimensions.

  • Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In Handbook on Ontologies, Springer, 151-171. → Metaproperty analysis: rigidity, identity, unity, dependence.

  • Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014). "OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation." IJSWIS 10(2), 7-34. → Pitfall catalogue: 41 anti-patterns for ontology design.

  • Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking using OntoQA." ICSC 2007, IEEE, 185-192. → Schema metrics: relationship richness, attribute richness.

  • Wille, R. (1982). "Restructuring Lattice Theory." In Ordered Sets, Reidel, 445-470. → Formal Concept Analysis: concept lattices from binary contexts.

  • Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept Analysis." AMIA Annual Symposium, PMC2605587. → FCA for ontology completeness auditing.

  • Keet, C.M. (2008). A Formal Theory of Granularity. PhD thesis, Free University of Bozen-Bolzano. → Granularity levels and perspectives for ontology design.

  • Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to Develop Domain-Specific Languages." ACM Computing Surveys 37(4), 316-344. → DSL design: soundness, completeness, laconicity.

  • Karsai, G. et al. (2014). "Design Guidelines for Domain Specific Languages." arXiv:1409.2378. → Orthogonality, necessary-and-sufficient principle.

  • Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A Comprehensive Survey." IEEE TKDE 35(5), 4969-4988. → KG quality dimensions: conciseness, consistency, completeness.