Files
markitect-main/examples/infospace-with-history/METRICS-METHODOLOGY.md
tegwick 4ce856d4d0 docs: metrics methodology, collection-level tasks, and infospace tooling roadmap
Add METRICS-METHODOLOGY.md documenting the theoretical frameworks
(SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for
two-layer evaluation (LLM-Eval + deterministic aggregation) across
five collection concerns: redundancy, coverage, coherence, consistency,
and granularity balance.

Extend INFRA-TASKS.md with assignment assessment (tasks 4-7),
per-concept metrics (tasks 8-12), and collection-level metrics
(tasks 13-19).

Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace,
topic, discipline, entity, evaluation, viability) and a three-stage
implementation plan: Stage 1 platform additions, Stage 2 infospace
tooling layer, Stage 3 example revision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 23:53:21 +01:00

502 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Collection-Level Metrics Methodology
How we evaluate the quality of the infospace as a **collection of
interrelated concepts**, beyond the quality of individual entities.
This document describes the theoretical frameworks drawn from ontology
engineering, formal concept analysis, semiotic quality theory, and DSL
design — and how each is adapted to work within MarkiTect's two-layer
evaluation model (LLM-Eval + deterministic aggregation).
---
## 1. The Two-Layer Model
Every metric in this methodology decomposes into two layers:
| Layer | What it does | How it runs |
|-------|-------------|-------------|
| **LLM-Eval** | Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" | Prompt template → LLM → structured YAML output |
| **Deterministic** | Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection | Python code in `process_chapters.py` or dedicated `metrics.py` |
The LLM-Eval layer produces **per-entity** or **per-pair** structured
scores. The deterministic layer **aggregates** these into collection-level
metrics, persisted as machine-readable YAML alongside human-readable
markdown reports.
Per-concept quality metrics (definition precision, source grounding, VSM
relevance — see INFRA-TASKS 8-12) operate at the individual entity level.
This document covers the five **collection-level concerns** that assess how
the entities work together as an explanatory system.
---
## 2. Five Collection-Level Concerns
### Overview
| # | Concern | Question | Primary framework |
|---|---------|----------|-------------------|
| C1 | Semantic Overlap | Are there redundant concepts? | OOPS! P2, embedding similarity |
| C2 | Coverage Completeness | Does the concept set cover the domain? | SEQUAL, FCA |
| C3 | Structural Coherence | Do concepts form a connected explanatory graph? | OntoQA, graph theory |
| C4 | Definitional Consistency | Are concepts defined consistently and non-circularly? | OntoClean, OOPS! P24 |
| C5 | Granularity Balance | Are concepts at comparable levels of abstraction? | Granularity theory, DSL laconicity |
---
## 3. Theoretical Frameworks
### 3.1 SEQUAL (Semiotic Quality Framework)
**Origin:** Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.
**What it defines:** Quality of a conceptual model as the correspondence
between three worlds — the domain (what exists), the model (what we
captured), and the audience's interpretation (what they understand).
Two key dimensions of **semantic quality**:
- **Validity** — everything in the model corresponds to something real
in the domain. No invented concepts.
- **Completeness** — everything relevant in the domain is represented in
the model. No missing concepts.
**How we use it:** SEQUAL frames our entire metrics approach. Every
collection-level metric maps to one of these dimensions:
| SEQUAL dimension | Our concerns |
|-----------------|--------------|
| Validity | C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid) |
| Completeness | C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps) |
| Both | C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity]) |
**Adaptation:** SEQUAL was designed for formal models evaluated by human
experts. We replace human judgment with LLM-Eval (for validity checks like
"does this concept correspond to something Smith actually described?") and
deterministic counting (for completeness checks like "which VSM systems
lack entity mappings?").
### 3.2 OntoClean
**Origin:** Guarino & Welty (2004).
**What it defines:** A methodology for validating taxonomic relationships
by assigning **metaproperties** to each concept:
- **Rigidity** — Is the property essential to all its instances? (e.g.
"market" is rigid; "effectual demander" is anti-rigid — an agent can
stop being an effectual demander)
- **Identity** — Does the concept carry an identity criterion? (e.g.
"division of labour" can be identified by its three causal mechanisms)
- **Unity** — Are all instances of this concept whole in the same way?
- **Dependence** — Does the concept require another concept to exist?
(e.g. "market price" depends on "effectual demand")
**Constraint:** A rigid concept cannot be subsumed by an anti-rigid one.
Violations indicate structural confusion.
**How we use it:** We do not have a formal taxonomy, but our flat entity
set implicitly contains subsumption relationships (e.g. "natural rate"
subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:
- **Granularity mismatches** (C5): A rigid concept at the same level as
an anti-rigid one suggests different abstraction levels are mixed.
- **Definitional consistency** (C4): If entity A depends on entity B per
OntoClean, but B's definition doesn't acknowledge A, the definitions
are inconsistent.
- **Redundancy** (C1): Two entities with identical metaproperty profiles
and overlapping definitions are candidates for merging.
**Adaptation:** Instead of manual metaproperty assignment, we use LLM-Eval
to classify each entity's rigidity, identity criterion, and dependencies.
The constraint checking is then deterministic.
### 3.3 OOPS! (Ontology Pitfall Scanner)
**Origin:** Poveda-Villalón et al. (2014). Catalogue of 41 common
ontology design pitfalls.
**What it defines:** Concrete, testable anti-patterns. The pitfalls most
relevant to our infospace:
| Pitfall | Description | Our concern |
|---------|-------------|-------------|
| P2 | Synonymous classes — different names, same meaning | C1 (redundancy) |
| P4 | Unconnected ontology elements | C3 (coherence) |
| P6 | Missing inverse relationships | C3 |
| P7 | Merging different concepts in the same class | C5 (granularity — too coarse) |
| P11 | Missing domain or range | C4 (consistency) |
| P19 | Missing disjointness axioms | C1 (how do we know two concepts don't overlap?) |
| P24 | Recursive/circular definition | C4 (consistency) |
| P25 | Inverse of itself | C4 |
**How we use it:** OOPS! pitfalls become a **checklist for LLM-Eval
prompts**. Rather than running a formal OWL scanner, we ask the LLM to
check for each pitfall pattern:
- "Are entities A and B synonymous?" (P2)
- "Does entity A's definition reference itself?" (P24)
- "Is entity A actually two distinct concepts merged together?" (P7)
The deterministic layer counts pitfall occurrences and tracks them over
time.
**Adaptation:** We select the subset of OOPS! pitfalls applicable to
semi-formal markdown-based ontologies (no OWL axioms) and implement each
as an LLM-Eval prompt pattern rather than a formal reasoner check.
### 3.4 OntoQA (Metric-Based Ontology Quality Analysis)
**Origin:** Tartir & Arpinar (2007).
**What it defines:** Quantitative schema-level and instance-level metrics:
- **Relationship Richness (RR):** Proportion of non-taxonomic (lateral)
relationships to total relationships. `RR = non_hierarchical / total`.
Low RR = mere taxonomy. High RR = rich cross-cutting connections.
- **Attribute Richness (AR):** Average number of attributes per concept.
`AR = total_attributes / total_concepts`.
- **Inheritance Richness (IR):** Average subclasses per class — measures
how knowledge distributes across the hierarchy.
- **Class Richness (CR):** Proportion of classes with instances.
**How we use it:** Our entities don't have formal relationships declared
between them, but we can **infer** a relationship graph from their
definitions and mappings:
- Entity A references entity B in its definition → definitional dependency
- Entities A and B map to the same VSM system → structural co-occurrence
- Entities A and B appear in the same chapter → contextual co-occurrence
From this inferred graph, we compute OntoQA metrics directly:
- **Relationship Richness** tells us whether our concepts form a web of
explanatory connections or just a flat list.
- **Attribute Richness** maps to our schema sections — entities with more
optional sections filled (Original Wording, Modern Interpretation) are
richer.
**Adaptation:** The key modification is that relationship inference is an
LLM-Eval step (pairwise: "does A's definition depend on or reference B?"),
after which all OntoQA metrics are computed deterministically on the
resulting graph.
### 3.5 Formal Concept Analysis (FCA)
**Origin:** Wille (1982). Applied to ontology auditing by Elhaj et al.
(2008) for SNOMED CT completeness checking.
**What it defines:** A mathematical framework for deriving a **concept
lattice** from a binary relation between objects and attributes. The
lattice reveals:
- **Formal concepts**: maximal sets of objects sharing the same attributes
- **Subconcept/superconcept** relationships: the natural hierarchy
- **Missing concepts**: attribute combinations with no corresponding object
**How we use it:** We construct a **formal context** (binary matrix):
- **Objects** = our 85 entities
- **Attributes** = economic domain, VSM system, source book, abstraction
level (from LLM-Eval), key terms (extracted from definitions)
The concept lattice then reveals:
- **Coverage gaps** (C2): Attribute combinations with no entity. E.g. if
the cell {Distribution, S3} is empty, we lack control-layer concepts
for distribution — a specific, actionable gap.
- **Redundancy** (C1): Entities with identical attribute sets (same formal
concept) are candidates for merging.
- **Granularity** (C5): The lattice depth indicates how many meaningful
levels of abstraction exist. A shallow lattice suggests missing
intermediate concepts.
**Adaptation:** Classic FCA requires crisp binary attributes. Our domains
and VSM mappings are already categorical, but abstraction level and key
terms need LLM-Eval to produce. The lattice computation itself is
deterministic (Python `concepts` library or equivalent). The FCA approach
replaces the current "ask the LLM about coverage" with a structural
computation that can identify *specific* gaps rather than vague
recommendations.
### 3.6 DSL Design Principles
**Origin:** Mernik et al. (2005) "When and How to Develop DSLs";
Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".
**What they define:** Quality criteria for a set of concepts that form a
language for a specific domain:
- **Soundness**: Every concept in the language corresponds to a real domain
concern (no invented abstractions).
- **Completeness**: The language can express everything needed for its
intended tasks.
- **Laconicity**: No unnecessary concepts — every concept earns its place.
- **Orthogonality**: Concepts are independent; combining any two produces
a meaningful result (no redundant combinations).
**How we use it:** Our entity set is effectively a domain-specific
vocabulary for "explaining classical economics through VSM". DSL quality
criteria translate directly:
- **Soundness** → Validity (SEQUAL): every entity grounded in Smith's text
- **Completeness** → Coverage (C2): can we answer the "competency
questions" the infospace is meant to address?
- **Laconicity** → Anti-redundancy (C1) + Indispensability (C5): would
removing any entity lose explanatory power?
- **Orthogonality** → Non-overlap (C1): entity definitions don't
substantially duplicate each other
**Adaptation:** We operationalise DSL completeness through **competency
questions** — a set of canonical questions the infospace should be able to
answer (e.g. "How does the division of labour relate to market extent?",
"What mechanisms regulate wages toward their natural rate?"). LLM-Eval
tests whether the current entity set suffices to answer each question.
Unanswerable questions identify specific completeness gaps.
Laconicity is operationalised as **indispensability scoring**: for each
entity, LLM-Eval rates whether removing it would lose explanatory power.
Low-scoring entities are candidates for merging or retirement.
---
## 4. Integration: Metric Definitions by Concern
### C1: Semantic Overlap / Redundancy
**Goal:** Identify entities that substantially overlap in meaning and
should be merged, distinguished, or retired.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `similarity_matrix` | Deterministic | Embed all entity definitions; compute NxN cosine similarity |
| `high_similarity_pairs` | Deterministic | Pairs with cosine > 0.80, sorted descending |
| `confirmed_synonyms` | LLM-Eval | For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap" |
| `redundancy_ratio` | Deterministic | `confirmed_synonyms / total_entities` |
| `intensional_conciseness` | Deterministic | `1 - redundancy_ratio` (from KG quality framework) |
**Pipeline:**
1. Embed definitions (embedding API or local model)
2. Compute cosine similarity matrix
3. Filter pairs above threshold
4. LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
5. Aggregate into ratio and conciseness score
**Output:** `output/metrics/redundancy-report.md` + structured YAML with
pair list, scores, and merge/retire recommendations.
### C2: Coverage Completeness
**Goal:** Identify domain areas and VSM systems that lack adequate
representation in the entity set.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `domain_vsm_matrix` | Deterministic | Count entities per {economic_domain, VSM_system} cell |
| `coverage_ratio` | Deterministic | `populated_cells / expected_cells` |
| `vsm_balance_entropy` | Deterministic | Shannon entropy of entity distribution across VSM systems (higher = more balanced) |
| `empty_cells` | Deterministic | List of {domain, VSM_system} pairs with zero entities |
| `competency_coverage` | LLM-Eval | For each competency question, can it be answered with current entities? |
| `fca_gap_concepts` | Deterministic | Attribute combinations in the FCA lattice with no corresponding entity |
**Pipeline:**
1. Parse entity metadata (domain, VSM mapping) from files on disk
2. Build domain × VSM matrix; identify empty cells
3. Build FCA formal context; compute lattice; extract gap concepts
4. Define competency questions (initially hand-written, later LLM-generated
from the source material)
5. LLM-evaluate answerability of each question
6. Aggregate into coverage ratio, entropy, and gap list
**Output:** `output/metrics/coverage-report.md` + YAML with matrix, gaps,
and competency question results.
### C3: Structural Coherence
**Goal:** Determine whether the entities form a connected explanatory web
or a fragmented collection of isolated concepts.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `relationship_graph` | LLM-Eval + Deterministic | Infer edges from definition cross-references (string matching) + LLM judgment for implicit references |
| `connected_components` | Deterministic | Number of connected components in the graph (target: 1) |
| `graph_density` | Deterministic | `actual_edges / possible_edges` |
| `avg_degree` | Deterministic | `total_edges / total_entities` |
| `relationship_richness` | Deterministic | OntoQA RR: `non_hierarchical_edges / total_edges` |
| `modularity` | Deterministic | Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation) |
| `bridge_concepts` | Deterministic | Entities with highest betweenness centrality (connect clusters) |
| `orphan_entities` | Deterministic | Entities with degree 0 or 1 |
| `cohesion_by_domain` | Deterministic | Avg intra-domain edges per entity |
| `coupling_across_domains` | Deterministic | Inter-domain edges / total edges |
**Pipeline:**
1. Extract explicit cross-references from definitions (entity name
mentions in other definitions — string matching with slug normalisation)
2. For entity pairs not caught by string matching, LLM-Eval: "Does A's
definition depend on or reference B's concept?"
3. Build directed graph
4. Compute graph metrics (networkx or equivalent)
5. Run community detection; compare detected communities to declared
economic domains
**Output:** `output/metrics/coherence-report.md` + YAML with graph
statistics, orphan list, bridge concepts, and community structure.
### C4: Definitional Consistency
**Goal:** Ensure entities are defined consistently, non-circularly, and
without contradicting each other.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `definitional_dependency_graph` | Deterministic + LLM-Eval | Edges where A's definition uses B's concept |
| `circular_definitions` | Deterministic | Cycles of length ≤ 3 in the dependency graph |
| `definition_depth` | Deterministic | Longest dependency chain per entity before reaching a term not in the entity set |
| `undefined_dependencies` | Deterministic | Terms used in definitions that arguably should be entities but aren't |
| `pairwise_consistency` | LLM-Eval | For related entity pairs (sharing edges): "Do these definitions contradict each other?" |
| `source_fidelity` | LLM-Eval | "Does this definition accurately represent what Smith wrote in the cited passage?" |
| `metaproperty_violations` | LLM-Eval + Deterministic | OntoClean constraint checking after LLM classifies rigidity/identity |
| `grounding_ratio` | Deterministic | Fraction of entities traceable to primitives without cycles |
**Pipeline:**
1. Build definitional dependency graph (same technique as C3, but directed
— A depends on B means A's definition uses B, not vice versa)
2. Detect cycles; flag short cycles
3. Extract undefined terms (terms matching entity-name patterns that appear
in definitions but have no corresponding entity file)
4. LLM pairwise consistency check on directly-connected pairs
5. LLM source fidelity check (compare definition to source chapter text)
6. LLM OntoClean metaproperty classification; deterministic constraint
checking
**Output:** `output/metrics/consistency-report.md` + YAML with cycle list,
undefined terms, contradiction candidates, and metaproperty violations.
### C5: Granularity Balance
**Goal:** Ensure entities operate at comparable levels of abstraction
within their respective domains and perspectives.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `abstraction_classification` | LLM-Eval | Classify each entity as theory-level / mechanism-level / observation-level |
| `scope_score` | LLM-Eval | Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle) |
| `abstraction_distribution` | Deterministic | Count per level; compute entropy |
| `scope_variance` | Deterministic | Variance of scope scores within each domain |
| `level_x_perspective_matrix` | Deterministic | Cross-tabulation of abstraction level × economic domain |
| `indispensability` | LLM-Eval | "If removed, what explanatory power is lost?" (1-5) |
| `dispensable_entities` | Deterministic | Entities with indispensability score ≤ 2 |
| `merge_candidates` | LLM-Eval | Pairs where one is a sub-case of the other |
**Pipeline:**
1. LLM-classify each entity: abstraction level, scope score,
indispensability
2. Build level × perspective matrix
3. Compute distribution entropy and per-domain scope variance
4. Flag outliers: entities whose scope score deviates > 1.5σ from their
domain mean
5. For outlier entities, LLM-Eval: "Should this be merged into a broader
concept, or split into sub-concepts?"
**Output:** `output/metrics/granularity-report.md` + YAML with
classifications, distribution, outliers, and merge/split recommendations.
---
## 5. Shared Infrastructure
Several concerns share underlying computations:
| Infrastructure | Used by | Build once |
|---------------|---------|------------|
| Definition embeddings (vector per entity) | C1, C3 | Embedding API call per entity |
| Relationship graph (entity → entity edges) | C3, C4 | String matching + LLM-Eval |
| FCA formal context (entity × attribute matrix) | C2, C5 | Metadata parsing + LLM classification |
| Entity metadata index (domain, VSM, chapter, sections) | C2, C5, C10 (schema compliance) | Deterministic markdown parsing |
These should be computed once per evaluation run and cached for use by
all concern-specific metrics.
---
## 6. Evaluation Workflow
A full collection-level evaluation run:
```
process_chapters.py --evaluate-collection --provider <provider>
```
1. **Parse** — deterministic metadata extraction from all entity files
2. **Embed** — compute definition embeddings (cached; only new/changed
entities need fresh embeddings)
3. **Infer** — LLM-Eval for relationship edges, metaproperties,
abstraction levels, pairwise judgments (batched to minimise LLM calls)
4. **Compute** — deterministic graph metrics, FCA lattice, coverage
matrix, similarity matrix, cycle detection
5. **Aggregate** — combine per-entity and per-pair scores into
collection-level metrics
6. **Report** — write per-concern markdown reports + unified `metrics.yaml`
7. **Append** — add timestamped snapshot to `metrics-history.yaml`
Incremental mode (`--evaluate-collection --chapter <id>`) re-evaluates
only the entities introduced or modified by that chapter, plus any
pairwise checks involving those entities.
---
## 7. References
- Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding
Quality in Conceptual Modeling." *IEEE Software* 11(2), 42-49.
→ SEQUAL framework: validity and completeness dimensions.
- Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In
*Handbook on Ontologies*, Springer, 151-171.
→ Metaproperty analysis: rigidity, identity, unity, dependence.
- Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014).
"OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology
Evaluation." *IJSWIS* 10(2), 7-34.
→ Pitfall catalogue: 41 anti-patterns for ontology design.
- Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking
using OntoQA." *ICSC 2007*, IEEE, 185-192.
→ Schema metrics: relationship richness, attribute richness.
- Wille, R. (1982). "Restructuring Lattice Theory." In *Ordered Sets*,
Reidel, 445-470.
→ Formal Concept Analysis: concept lattices from binary contexts.
- Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept
Analysis." *AMIA Annual Symposium*, PMC2605587.
→ FCA for ontology completeness auditing.
- Keet, C.M. (2008). *A Formal Theory of Granularity.* PhD thesis,
Free University of Bozen-Bolzano.
→ Granularity levels and perspectives for ontology design.
- Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to
Develop Domain-Specific Languages." *ACM Computing Surveys* 37(4),
316-344.
→ DSL design: soundness, completeness, laconicity.
- Karsai, G. et al. (2014). "Design Guidelines for Domain Specific
Languages." *arXiv:1409.2378*.
→ Orthogonality, necessary-and-sufficient principle.
- Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A
Comprehensive Survey." *IEEE TKDE* 35(5), 4969-4988.
→ KG quality dimensions: conciseness, consistency, completeness.