Files

tegwick 4ce856d4d0 docs: metrics methodology, collection-level tasks, and infospace tooling roadmap

Add METRICS-METHODOLOGY.md documenting the theoretical frameworks
(SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for
two-layer evaluation (LLM-Eval + deterministic aggregation) across
five collection concerns: redundancy, coverage, coherence, consistency,
and granularity balance.

Extend INFRA-TASKS.md with assignment assessment (tasks 4-7),
per-concept metrics (tasks 8-12), and collection-level metrics
(tasks 13-19).

Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace,
topic, discipline, entity, evaluation, viability) and a three-stage
implementation plan: Stage 1 platform additions, Stage 2 infospace
tooling layer, Stage 3 example revision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-18 23:53:21 +01:00

23 KiB

Raw Blame History

Collection-Level Metrics Methodology

How we evaluate the quality of the infospace as a collection of interrelated concepts, beyond the quality of individual entities.

This document describes the theoretical frameworks drawn from ontology engineering, formal concept analysis, semiotic quality theory, and DSL design — and how each is adapted to work within MarkiTect's two-layer evaluation model (LLM-Eval + deterministic aggregation).

1. The Two-Layer Model

Every metric in this methodology decomposes into two layers:

Layer	What it does	How it runs
LLM-Eval	Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?"	Prompt template → LLM → structured YAML output
Deterministic	Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection	Python code in `process_chapters.py` or dedicated `metrics.py`

The LLM-Eval layer produces per-entity or per-pair structured scores. The deterministic layer aggregates these into collection-level metrics, persisted as machine-readable YAML alongside human-readable markdown reports.

Per-concept quality metrics (definition precision, source grounding, VSM relevance — see INFRA-TASKS 8-12) operate at the individual entity level. This document covers the five collection-level concerns that assess how the entities work together as an explanatory system.

2. Five Collection-Level Concerns

Overview

#	Concern	Question	Primary framework
C1	Semantic Overlap	Are there redundant concepts?	OOPS! P2, embedding similarity
C2	Coverage Completeness	Does the concept set cover the domain?	SEQUAL, FCA
C3	Structural Coherence	Do concepts form a connected explanatory graph?	OntoQA, graph theory
C4	Definitional Consistency	Are concepts defined consistently and non-circularly?	OntoClean, OOPS! P24
C5	Granularity Balance	Are concepts at comparable levels of abstraction?	Granularity theory, DSL laconicity

3. Theoretical Frameworks

3.1 SEQUAL (Semiotic Quality Framework)

Origin: Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.

What it defines: Quality of a conceptual model as the correspondence between three worlds — the domain (what exists), the model (what we captured), and the audience's interpretation (what they understand).

Two key dimensions of semantic quality:

Validity — everything in the model corresponds to something real in the domain. No invented concepts.
Completeness — everything relevant in the domain is represented in the model. No missing concepts.

How we use it: SEQUAL frames our entire metrics approach. Every collection-level metric maps to one of these dimensions:

SEQUAL dimension	Our concerns
Validity	C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid)
Completeness	C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps)
Both	C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity])

Adaptation: SEQUAL was designed for formal models evaluated by human experts. We replace human judgment with LLM-Eval (for validity checks like "does this concept correspond to something Smith actually described?") and deterministic counting (for completeness checks like "which VSM systems lack entity mappings?").

3.2 OntoClean

Origin: Guarino & Welty (2004).

What it defines: A methodology for validating taxonomic relationships by assigning metaproperties to each concept:

Rigidity — Is the property essential to all its instances? (e.g. "market" is rigid; "effectual demander" is anti-rigid — an agent can stop being an effectual demander)
Identity — Does the concept carry an identity criterion? (e.g. "division of labour" can be identified by its three causal mechanisms)
Unity — Are all instances of this concept whole in the same way?
Dependence — Does the concept require another concept to exist? (e.g. "market price" depends on "effectual demand")

Constraint: A rigid concept cannot be subsumed by an anti-rigid one. Violations indicate structural confusion.

How we use it: We do not have a formal taxonomy, but our flat entity set implicitly contains subsumption relationships (e.g. "natural rate" subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:

Granularity mismatches (C5): A rigid concept at the same level as an anti-rigid one suggests different abstraction levels are mixed.
Definitional consistency (C4): If entity A depends on entity B per OntoClean, but B's definition doesn't acknowledge A, the definitions are inconsistent.
Redundancy (C1): Two entities with identical metaproperty profiles and overlapping definitions are candidates for merging.

Adaptation: Instead of manual metaproperty assignment, we use LLM-Eval to classify each entity's rigidity, identity criterion, and dependencies. The constraint checking is then deterministic.

3.3 OOPS! (Ontology Pitfall Scanner)

Origin: Poveda-Villalón et al. (2014). Catalogue of 41 common ontology design pitfalls.

What it defines: Concrete, testable anti-patterns. The pitfalls most relevant to our infospace:

Pitfall	Description	Our concern
P2	Synonymous classes — different names, same meaning	C1 (redundancy)
P4	Unconnected ontology elements	C3 (coherence)
P6	Missing inverse relationships	C3
P7	Merging different concepts in the same class	C5 (granularity — too coarse)
P11	Missing domain or range	C4 (consistency)
P19	Missing disjointness axioms	C1 (how do we know two concepts don't overlap?)
P24	Recursive/circular definition	C4 (consistency)
P25	Inverse of itself	C4

How we use it: OOPS! pitfalls become a checklist for LLM-Eval prompts. Rather than running a formal OWL scanner, we ask the LLM to check for each pitfall pattern:

"Are entities A and B synonymous?" (P2)
"Does entity A's definition reference itself?" (P24)
"Is entity A actually two distinct concepts merged together?" (P7)

The deterministic layer counts pitfall occurrences and tracks them over time.

Adaptation: We select the subset of OOPS! pitfalls applicable to semi-formal markdown-based ontologies (no OWL axioms) and implement each as an LLM-Eval prompt pattern rather than a formal reasoner check.

3.4 OntoQA (Metric-Based Ontology Quality Analysis)

Origin: Tartir & Arpinar (2007).

What it defines: Quantitative schema-level and instance-level metrics:

Relationship Richness (RR): Proportion of non-taxonomic (lateral) relationships to total relationships. RR = non_hierarchical / total. Low RR = mere taxonomy. High RR = rich cross-cutting connections.
Attribute Richness (AR): Average number of attributes per concept. AR = total_attributes / total_concepts.
Inheritance Richness (IR): Average subclasses per class — measures how knowledge distributes across the hierarchy.
Class Richness (CR): Proportion of classes with instances.

How we use it: Our entities don't have formal relationships declared between them, but we can infer a relationship graph from their definitions and mappings:

Entity A references entity B in its definition → definitional dependency
Entities A and B map to the same VSM system → structural co-occurrence
Entities A and B appear in the same chapter → contextual co-occurrence

From this inferred graph, we compute OntoQA metrics directly:

Relationship Richness tells us whether our concepts form a web of explanatory connections or just a flat list.
Attribute Richness maps to our schema sections — entities with more optional sections filled (Original Wording, Modern Interpretation) are richer.

Adaptation: The key modification is that relationship inference is an LLM-Eval step (pairwise: "does A's definition depend on or reference B?"), after which all OntoQA metrics are computed deterministically on the resulting graph.

3.5 Formal Concept Analysis (FCA)

Origin: Wille (1982). Applied to ontology auditing by Elhaj et al. (2008) for SNOMED CT completeness checking.

What it defines: A mathematical framework for deriving a concept lattice from a binary relation between objects and attributes. The lattice reveals:

Formal concepts: maximal sets of objects sharing the same attributes
Subconcept/superconcept relationships: the natural hierarchy
Missing concepts: attribute combinations with no corresponding object

How we use it: We construct a formal context (binary matrix):

Objects = our 85 entities
Attributes = economic domain, VSM system, source book, abstraction level (from LLM-Eval), key terms (extracted from definitions)

The concept lattice then reveals:

Coverage gaps (C2): Attribute combinations with no entity. E.g. if the cell {Distribution, S3} is empty, we lack control-layer concepts for distribution — a specific, actionable gap.
Redundancy (C1): Entities with identical attribute sets (same formal concept) are candidates for merging.
Granularity (C5): The lattice depth indicates how many meaningful levels of abstraction exist. A shallow lattice suggests missing intermediate concepts.

Adaptation: Classic FCA requires crisp binary attributes. Our domains and VSM mappings are already categorical, but abstraction level and key terms need LLM-Eval to produce. The lattice computation itself is deterministic (Python concepts library or equivalent). The FCA approach replaces the current "ask the LLM about coverage" with a structural computation that can identify specific gaps rather than vague recommendations.

3.6 DSL Design Principles

Origin: Mernik et al. (2005) "When and How to Develop DSLs"; Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".

What they define: Quality criteria for a set of concepts that form a language for a specific domain:

Soundness: Every concept in the language corresponds to a real domain concern (no invented abstractions).
Completeness: The language can express everything needed for its intended tasks.
Laconicity: No unnecessary concepts — every concept earns its place.
Orthogonality: Concepts are independent; combining any two produces a meaningful result (no redundant combinations).

How we use it: Our entity set is effectively a domain-specific vocabulary for "explaining classical economics through VSM". DSL quality criteria translate directly:

Soundness → Validity (SEQUAL): every entity grounded in Smith's text
Completeness → Coverage (C2): can we answer the "competency questions" the infospace is meant to address?
Laconicity → Anti-redundancy (C1) + Indispensability (C5): would removing any entity lose explanatory power?
Orthogonality → Non-overlap (C1): entity definitions don't substantially duplicate each other

Adaptation: We operationalise DSL completeness through competency questions — a set of canonical questions the infospace should be able to answer (e.g. "How does the division of labour relate to market extent?", "What mechanisms regulate wages toward their natural rate?"). LLM-Eval tests whether the current entity set suffices to answer each question. Unanswerable questions identify specific completeness gaps.

Laconicity is operationalised as indispensability scoring: for each entity, LLM-Eval rates whether removing it would lose explanatory power. Low-scoring entities are candidates for merging or retirement.

4. Integration: Metric Definitions by Concern

C1: Semantic Overlap / Redundancy

Goal: Identify entities that substantially overlap in meaning and should be merged, distinguished, or retired.

Metrics:

Metric	Type	Computation
`similarity_matrix`	Deterministic	Embed all entity definitions; compute NxN cosine similarity
`high_similarity_pairs`	Deterministic	Pairs with cosine > 0.80, sorted descending
`confirmed_synonyms`	LLM-Eval	For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap"
`redundancy_ratio`	Deterministic	`confirmed_synonyms / total_entities`
`intensional_conciseness`	Deterministic	`1 - redundancy_ratio` (from KG quality framework)

Pipeline:

Embed definitions (embedding API or local model)
Compute cosine similarity matrix
Filter pairs above threshold
LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
Aggregate into ratio and conciseness score

Output: output/metrics/redundancy-report.md + structured YAML with pair list, scores, and merge/retire recommendations.

C2: Coverage Completeness

Goal: Identify domain areas and VSM systems that lack adequate representation in the entity set.

Metrics:

Metric	Type	Computation
`domain_vsm_matrix`	Deterministic	Count entities per {economic_domain, VSM_system} cell
`coverage_ratio`	Deterministic	`populated_cells / expected_cells`
`vsm_balance_entropy`	Deterministic	Shannon entropy of entity distribution across VSM systems (higher = more balanced)
`empty_cells`	Deterministic	List of {domain, VSM_system} pairs with zero entities
`competency_coverage`	LLM-Eval	For each competency question, can it be answered with current entities?
`fca_gap_concepts`	Deterministic	Attribute combinations in the FCA lattice with no corresponding entity

Pipeline:

Parse entity metadata (domain, VSM mapping) from files on disk
Build domain × VSM matrix; identify empty cells
Build FCA formal context; compute lattice; extract gap concepts
Define competency questions (initially hand-written, later LLM-generated from the source material)
LLM-evaluate answerability of each question
Aggregate into coverage ratio, entropy, and gap list

Output: output/metrics/coverage-report.md + YAML with matrix, gaps, and competency question results.

C3: Structural Coherence

Goal: Determine whether the entities form a connected explanatory web or a fragmented collection of isolated concepts.

Metrics:

Metric	Type	Computation
`relationship_graph`	LLM-Eval + Deterministic	Infer edges from definition cross-references (string matching) + LLM judgment for implicit references
`connected_components`	Deterministic	Number of connected components in the graph (target: 1)
`graph_density`	Deterministic	`actual_edges / possible_edges`
`avg_degree`	Deterministic	`total_edges / total_entities`
`relationship_richness`	Deterministic	OntoQA RR: `non_hierarchical_edges / total_edges`
`modularity`	Deterministic	Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation)
`bridge_concepts`	Deterministic	Entities with highest betweenness centrality (connect clusters)
`orphan_entities`	Deterministic	Entities with degree 0 or 1
`cohesion_by_domain`	Deterministic	Avg intra-domain edges per entity
`coupling_across_domains`	Deterministic	Inter-domain edges / total edges

Pipeline:

Extract explicit cross-references from definitions (entity name mentions in other definitions — string matching with slug normalisation)
For entity pairs not caught by string matching, LLM-Eval: "Does A's definition depend on or reference B's concept?"
Build directed graph
Compute graph metrics (networkx or equivalent)
Run community detection; compare detected communities to declared economic domains

Output: output/metrics/coherence-report.md + YAML with graph statistics, orphan list, bridge concepts, and community structure.

C4: Definitional Consistency

Goal: Ensure entities are defined consistently, non-circularly, and without contradicting each other.

Metrics:

Metric	Type	Computation
`definitional_dependency_graph`	Deterministic + LLM-Eval	Edges where A's definition uses B's concept
`circular_definitions`	Deterministic	Cycles of length ≤ 3 in the dependency graph
`definition_depth`	Deterministic	Longest dependency chain per entity before reaching a term not in the entity set
`undefined_dependencies`	Deterministic	Terms used in definitions that arguably should be entities but aren't
`pairwise_consistency`	LLM-Eval	For related entity pairs (sharing edges): "Do these definitions contradict each other?"
`source_fidelity`	LLM-Eval	"Does this definition accurately represent what Smith wrote in the cited passage?"
`metaproperty_violations`	LLM-Eval + Deterministic	OntoClean constraint checking after LLM classifies rigidity/identity
`grounding_ratio`	Deterministic	Fraction of entities traceable to primitives without cycles

Pipeline:

Build definitional dependency graph (same technique as C3, but directed — A depends on B means A's definition uses B, not vice versa)
Detect cycles; flag short cycles
Extract undefined terms (terms matching entity-name patterns that appear in definitions but have no corresponding entity file)
LLM pairwise consistency check on directly-connected pairs
LLM source fidelity check (compare definition to source chapter text)
LLM OntoClean metaproperty classification; deterministic constraint checking

Output: output/metrics/consistency-report.md + YAML with cycle list, undefined terms, contradiction candidates, and metaproperty violations.

C5: Granularity Balance

Goal: Ensure entities operate at comparable levels of abstraction within their respective domains and perspectives.

Metrics:

Metric	Type	Computation
`abstraction_classification`	LLM-Eval	Classify each entity as theory-level / mechanism-level / observation-level
`scope_score`	LLM-Eval	Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle)
`abstraction_distribution`	Deterministic	Count per level; compute entropy
`scope_variance`	Deterministic	Variance of scope scores within each domain
`level_x_perspective_matrix`	Deterministic	Cross-tabulation of abstraction level × economic domain
`indispensability`	LLM-Eval	"If removed, what explanatory power is lost?" (1-5)
`dispensable_entities`	Deterministic	Entities with indispensability score ≤ 2
`merge_candidates`	LLM-Eval	Pairs where one is a sub-case of the other

Pipeline:

LLM-classify each entity: abstraction level, scope score, indispensability
Build level × perspective matrix
Compute distribution entropy and per-domain scope variance
Flag outliers: entities whose scope score deviates > 1.5σ from their domain mean
For outlier entities, LLM-Eval: "Should this be merged into a broader concept, or split into sub-concepts?"

Output: output/metrics/granularity-report.md + YAML with classifications, distribution, outliers, and merge/split recommendations.

5. Shared Infrastructure

Several concerns share underlying computations:

Infrastructure	Used by	Build once
Definition embeddings (vector per entity)	C1, C3	Embedding API call per entity
Relationship graph (entity → entity edges)	C3, C4	String matching + LLM-Eval
FCA formal context (entity × attribute matrix)	C2, C5	Metadata parsing + LLM classification
Entity metadata index (domain, VSM, chapter, sections)	C2, C5, C10 (schema compliance)	Deterministic markdown parsing

These should be computed once per evaluation run and cached for use by all concern-specific metrics.

6. Evaluation Workflow

A full collection-level evaluation run:

process_chapters.py --evaluate-collection --provider <provider>

Parse — deterministic metadata extraction from all entity files
Embed — compute definition embeddings (cached; only new/changed entities need fresh embeddings)
Infer — LLM-Eval for relationship edges, metaproperties, abstraction levels, pairwise judgments (batched to minimise LLM calls)
Compute — deterministic graph metrics, FCA lattice, coverage matrix, similarity matrix, cycle detection
Aggregate — combine per-entity and per-pair scores into collection-level metrics
Report — write per-concern markdown reports + unified metrics.yaml
Append — add timestamped snapshot to metrics-history.yaml

Incremental mode (--evaluate-collection --chapter <id>) re-evaluates only the entities introduced or modified by that chapter, plus any pairwise checks involving those entities.

7. References

Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding Quality in Conceptual Modeling." IEEE Software 11(2), 42-49. → SEQUAL framework: validity and completeness dimensions.
Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In Handbook on Ontologies, Springer, 151-171. → Metaproperty analysis: rigidity, identity, unity, dependence.
Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014). "OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation." IJSWIS 10(2), 7-34. → Pitfall catalogue: 41 anti-patterns for ontology design.
Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking using OntoQA." ICSC 2007, IEEE, 185-192. → Schema metrics: relationship richness, attribute richness.
Wille, R. (1982). "Restructuring Lattice Theory." In Ordered Sets, Reidel, 445-470. → Formal Concept Analysis: concept lattices from binary contexts.
Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept Analysis." AMIA Annual Symposium, PMC2605587. → FCA for ontology completeness auditing.
Keet, C.M. (2008). A Formal Theory of Granularity. PhD thesis, Free University of Bozen-Bolzano. → Granularity levels and perspectives for ontology design.
Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to Develop Domain-Specific Languages." ACM Computing Surveys 37(4), 316-344. → DSL design: soundness, completeness, laconicity.
Karsai, G. et al. (2014). "Design Guidelines for Domain Specific Languages." arXiv:1409.2378. → Orthogonality, necessary-and-sufficient principle.
Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A Comprehensive Survey." IEEE TKDE 35(5), 4969-4988. → KG quality dimensions: conciseness, consistency, completeness.

23 KiB Raw Blame History Unescape Escape

Collection-Level Metrics Methodology

1. The Two-Layer Model

2. Five Collection-Level Concerns

Overview

3. Theoretical Frameworks

3.1 SEQUAL (Semiotic Quality Framework)

3.2 OntoClean

3.3 OOPS! (Ontology Pitfall Scanner)

3.4 OntoQA (Metric-Based Ontology Quality Analysis)

3.5 Formal Concept Analysis (FCA)

3.6 DSL Design Principles

4. Integration: Metric Definitions by Concern

C1: Semantic Overlap / Redundancy

C2: Coverage Completeness

C3: Structural Coherence

C4: Definitional Consistency

C5: Granularity Balance

5. Shared Infrastructure

6. Evaluation Workflow

7. References

23 KiB

Raw Blame History