Add METRICS-METHODOLOGY.md documenting the theoretical frameworks (SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for two-layer evaluation (LLM-Eval + deterministic aggregation) across five collection concerns: redundancy, coverage, coherence, consistency, and granularity balance. Extend INFRA-TASKS.md with assignment assessment (tasks 4-7), per-concept metrics (tasks 8-12), and collection-level metrics (tasks 13-19). Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace, topic, discipline, entity, evaluation, viability) and a three-stage implementation plan: Stage 1 platform additions, Stage 2 infospace tooling layer, Stage 3 example revision. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
550 lines
25 KiB
Markdown
550 lines
25 KiB
Markdown
# Markitect Infrastructure Tasks
|
||
|
||
Issues discovered while building the infospace-with-history example.
|
||
All three have been fixed in commit `706981c` and the pipeline script
|
||
refactored to use the fixed infrastructure directly.
|
||
|
||
## 1. Artifact Repository does not store content — RESOLVED
|
||
|
||
**File:** `markitect/prompts/resolver/resolver.py`, line 147-148
|
||
**Issue:** `content = f"[Content of {artifact.name} from {space_id}]"` — the
|
||
resolver returns placeholder text instead of actual artifact content because
|
||
the SQLiteArtifactRepository stores metadata (digest, name, type) but not
|
||
the content itself.
|
||
**Impact:** Consumers must maintain their own content cache alongside the
|
||
repository, defeating the purpose of centralised artifact storage.
|
||
**Fix applied:** Added `content` field to `Artifact` model, `content TEXT`
|
||
column to SQLite schema (with migration for existing DBs), and replaced
|
||
the resolver placeholder with `artifact.content`.
|
||
|
||
## 2. ContentMacro raw_text defaults to empty string — RESOLVED
|
||
|
||
**File:** `markitect/prompts/templates/models.py`, line 46
|
||
**Issue:** `raw_text: str = ""` — when macros are constructed programmatically
|
||
(not parsed from template text), `raw_text` defaults to `""`. The
|
||
ContextCompiler then calls `str.replace("", resolved.content)` which inserts
|
||
content between every character, producing multi-gigabyte output.
|
||
**Impact:** Silent data corruption; compiled prompts become unusable.
|
||
**Fix applied:** Added `__post_init__` to `ContentMacro` that auto-derives
|
||
`raw_text = f"@{{{self.target}}}"` when not provided.
|
||
|
||
## 3. No TemplateAnalyzer support for @{target} syntax — RESOLVED
|
||
|
||
**File:** `markitect/prompts/templates/parser.py`
|
||
**Issue:** The MacroParser parses `{{kind:target}}` syntax but the
|
||
templates in this example use the simplified `@{target}` syntax. There's
|
||
no automatic parsing for this format, requiring manual macro construction.
|
||
**Fix applied:** Added `SHORTHAND_PATTERN` to `MacroParser` that recognises
|
||
`@{target}` and maps it to `MacroKind.REQUIRED`. Updated `has_macros()`,
|
||
`count_macros()`, and `find_macro_positions()` accordingly.
|
||
|
||
---
|
||
|
||
## Assignment Assessment (18 Feb 2026)
|
||
|
||
How the example measures against the objectives stated in `README.md`:
|
||
|
||
| # | Objective | Status | Notes |
|
||
|---|-----------|--------|-------|
|
||
| 1 | Capture knowledge from Wealth of Nations | **Partial** | 7 of 35 chapters processed (Book I, ch. 1-7). 85 canonical entities extracted. |
|
||
| 2 | Transform to VSM concepts/entities | **Done (for processed chapters)** | Entities mapped to S1-S5 with strength ratings. |
|
||
| 3 | Consistent and complete | **Not yet** | Only 20% of chapters done. Metrics report exists but covers limited scope. |
|
||
| 4 | Schemas as scaffolding | **Done** | Four schemas defined and used across all stages. |
|
||
| 5 | Prompt dependency resolution | **Done** | `@{macro}` templates resolved via MultiSpaceResolutionStrategy. |
|
||
| 6 | Incremental chapter injection | **Done** | Pipeline processes one chapter at a time; `@{existing_entities}` prevents duplication. |
|
||
| 7 | Keep changes as git history | **Not done** | See task 4 below. |
|
||
| 8 | Metrics for completeness/consistency | **Partial** | Template and report exist but only cover 4 chapters (report predates ch. 5-7). |
|
||
| 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. |
|
||
| 10 | Generate task list for infra issues | **Done** | This file. |
|
||
|
||
## 4. Infospace has no per-chapter git history — OPEN
|
||
|
||
**Objective:** README states "The information space should utilize the option
|
||
of keeping changes as git history."
|
||
**Issue:** The 7 processed chapters were committed in mixed batches alongside
|
||
infrastructure changes (LLM adapters, entity refactoring, archive policy).
|
||
Chapters 1-2 are bundled into `fecc2fd` with the entire LLM module.
|
||
Chapters 5-7 share a single commit (`41773f1`) with the OpenAI adapter and
|
||
archive policy. There is no commit where you can `git diff` to see exactly
|
||
what one chapter contributed to the infospace.
|
||
**Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how
|
||
the infospace grew chapter by chapter — the core promise of "with history."
|
||
**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using
|
||
`process_chapters.py` without `--no-commit`, on a clean branch or after
|
||
squashing the current output into a baseline commit. Each chapter gets its
|
||
own commit via `_git_commit_chapter()`.
|
||
|
||
## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN
|
||
|
||
**Issue:** Running `--all --no-commit` to regenerate `infospace.db` also
|
||
overwrites `*-prompt.md` files in the output directories because each
|
||
pipeline stage unconditionally writes the compiled prompt before checking
|
||
whether output already exists. The `@{existing_entities}` macro content
|
||
shifts as earlier chapters are loaded, so prompt files for already-processed
|
||
chapters change on every full run.
|
||
**Impact:** A DB regeneration dirties the working tree with prompt file
|
||
changes, even though no actual outputs changed. Users must `git checkout`
|
||
the prompt files after regeneration.
|
||
**Suggested fix:** Skip writing prompt files when the corresponding output
|
||
file already exists on disk, or add a `--rebuild-db-only` flag that
|
||
populates the database without touching the file system.
|
||
|
||
## 6. Metrics report is stale — OPEN
|
||
|
||
**Issue:** The metrics report (`output/metrics/metrics-report.md`) was
|
||
generated after chapters 1-4. Chapters 5-7 have since been processed but
|
||
the report has not been refreshed.
|
||
**Impact:** The metrics do not reflect the current state of the infospace.
|
||
**Suggested fix:** Re-run `--metrics --provider <provider> --no-commit`
|
||
after every batch of new chapters. Consider making metrics assessment
|
||
automatic at the end of `--book` or `--all` runs.
|
||
|
||
## 7. Remaining 28 chapters not yet processed — OPEN
|
||
|
||
**Issue:** Only Book I chapters 1-7 have been processed. Books II-V
|
||
(28 chapters) remain unprocessed.
|
||
**Impact:** The infospace is incomplete — VSM coverage is limited to S1,
|
||
S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic
|
||
signals, recursion, variety) are expected to emerge from later books.
|
||
**Suggested fix:** Process remaining chapters in book-sized batches with
|
||
per-chapter commits, refreshing metrics after each book.
|
||
|
||
---
|
||
|
||
## Per-Concept Metrics (tasks 8-12)
|
||
|
||
The current metrics system is a single LLM-evaluated narrative report that
|
||
assesses the infospace as a whole. It produces no machine-readable output,
|
||
cannot be tracked over time, and conflates per-concept quality with
|
||
collection-level coherence.
|
||
|
||
The improvement splits metrics into two layers:
|
||
|
||
- **LLM-Eval**: A prompt template evaluates each concept individually
|
||
against quality criteria defined in the schema. The LLM returns structured
|
||
scores, not prose.
|
||
- **Deterministic aggregation**: `process_chapters.py` computes what it can
|
||
from files on disk (schema compliance, word counts, section presence,
|
||
coverage tallies) and aggregates LLM-eval scores into dashboard metrics.
|
||
|
||
Both layers persist results in structured form so they can be diffed,
|
||
tracked over time, and committed alongside the entities they evaluate.
|
||
|
||
## 8. Add per-concept quality metrics to entity schema — OPEN
|
||
|
||
**Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines
|
||
required sections and validation rules (section presence, word count range)
|
||
but no quality criteria. There is no definition of what makes a *good*
|
||
entity versus a merely *compliant* one.
|
||
**Suggested fix:** Add a `## Quality Metrics` section to the entity schema
|
||
defining evaluation dimensions with scoring rubrics:
|
||
|
||
- **Definition Precision** (1-5): Is the definition specific, non-circular,
|
||
and distinguishable from neighbouring concepts?
|
||
- **Source Grounding** (1-5): Is the entity grounded in a specific passage?
|
||
Does the citation exist and support the definition?
|
||
- **Domain Placement** (1-5): Is the economic domain assignment correct and
|
||
specific (not just "General Theory")?
|
||
- **VSM Relevance** (1-5): Does the entity connect meaningfully to at least
|
||
one VSM system, or is it too granular/abstract to map?
|
||
- **Explanatory Value** (1-5): Does this entity contribute to explaining
|
||
the economic system, or is it a restatement of another concept?
|
||
|
||
Similarly update the VSM mapping schema with:
|
||
|
||
- **Rationale Rigour** (1-5): Is the mapping justified with reference to
|
||
Beer's definitions, not just surface-level analogy?
|
||
- **Strength Calibration** (1-5): Is the declared strength (Strong/Moderate/
|
||
Weak) consistent with the rationale given?
|
||
|
||
These rubrics become the prompt instructions for task 9.
|
||
|
||
## 9. Create evaluate-entity prompt template — OPEN
|
||
|
||
**Depends on:** Task 8 (quality metrics in schema).
|
||
**Issue:** There is no mechanism to evaluate an existing entity after
|
||
extraction. Quality is only judged implicitly during the global metrics
|
||
assessment, which is too coarse to identify individual weak entities.
|
||
**Suggested fix:** Create `templates/evaluate-entity.md` — a prompt
|
||
template that:
|
||
|
||
1. Takes `@{entity_content}`, `@{source_chapter}`, `@{vsm_framework}`,
|
||
and `@{quality_rubric}` (from the schema's quality metrics section).
|
||
2. Asks the LLM to score each dimension (1-5) with a one-sentence
|
||
justification per score.
|
||
3. Outputs structured YAML front-matter (scores) followed by markdown
|
||
(justifications), e.g.:
|
||
|
||
```yaml
|
||
---
|
||
entity: division-of-labour
|
||
scores:
|
||
definition_precision: 5
|
||
source_grounding: 5
|
||
domain_placement: 4
|
||
vsm_relevance: 5
|
||
explanatory_value: 5
|
||
overall: 4.8
|
||
flags: []
|
||
---
|
||
```
|
||
|
||
Add a pipeline stage: `--evaluate` runs this template against every
|
||
canonical entity and writes results to `output/evaluations/<slug>-eval.md`.
|
||
A `--evaluate --chapter <id>` variant evaluates only entities introduced
|
||
by that chapter.
|
||
|
||
## 10. Add deterministic schema compliance checker — OPEN
|
||
|
||
**Issue:** Schema compliance is currently LLM-evaluated ("100%" in the
|
||
metrics report) but the validation rules in the schemas are mechanical:
|
||
section presence, word count ranges, heading format. These should be
|
||
checked programmatically, not by an LLM.
|
||
**Suggested fix:** Add a `validate_entity(path) -> ValidationResult`
|
||
function to `process_chapters.py` (or a new `validate.py` module) that:
|
||
|
||
- Parses the markdown to extract H2 section headings
|
||
- Checks required sections are present (Definition, Source Chapter,
|
||
Context, Economic Domain)
|
||
- Counts words in the Definition section (must be 20-150)
|
||
- Checks H1 heading exists and is not a slug (e.g. `effectual-demand`
|
||
in chapter 7 has `# effectual-demand` instead of `# Effectual Demand`)
|
||
- Validates Source Chapter cites a specific book/chapter
|
||
- For mapping files: checks Mapping Strength is one of the enum values
|
||
|
||
Expose as `--validate` CLI flag. Output a structured report:
|
||
|
||
```
|
||
Validation: 85 entities, 3 warnings
|
||
effectual-demand.md: H1 is slug format, not title case
|
||
porter.md: Definition is 18 words (minimum 20)
|
||
...
|
||
```
|
||
|
||
This is fully deterministic — no LLM calls needed.
|
||
|
||
## 11. Structured metrics output format — OPEN
|
||
|
||
**Depends on:** Tasks 9 and 10.
|
||
**Issue:** The metrics report is a markdown narrative. Values cannot be
|
||
parsed programmatically, diffed meaningfully, or plotted over time.
|
||
**Suggested fix:** Alongside the human-readable `metrics-report.md`,
|
||
emit a machine-readable `metrics.yaml` (or `.json`) containing:
|
||
|
||
```yaml
|
||
timestamp: "2026-02-18T12:00:00Z"
|
||
chapters_processed: 7
|
||
chapters_total: 35
|
||
entities_total: 85
|
||
entities_archived: 0
|
||
vsm_coverage:
|
||
S1: 28
|
||
S2: 12
|
||
S3: 8
|
||
S3_star: 0
|
||
S4: 5
|
||
S5: 0
|
||
recursion: 1
|
||
variety: 0
|
||
mapping_strength:
|
||
strong: 64
|
||
moderate: 18
|
||
weak: 3
|
||
validation:
|
||
schema_compliant: 82
|
||
warnings: 3
|
||
evaluation: # from LLM-eval (task 9)
|
||
mean_overall: 4.2
|
||
min_overall: 2.8
|
||
flagged_entities: ["porter", "country-workman"]
|
||
```
|
||
|
||
The `--metrics` command writes both files. The YAML file is committed
|
||
to git so `git diff` shows exactly how metrics changed between runs.
|
||
|
||
## 12. Metrics-over-time tracking — OPEN
|
||
|
||
**Depends on:** Task 11 (structured output).
|
||
**Issue:** There is one metrics snapshot that gets overwritten. No history
|
||
of how metrics evolved as chapters were added.
|
||
**Suggested fix:** Append each metrics snapshot to a cumulative log file
|
||
`output/metrics/metrics-history.yaml` (list of timestamped entries). This
|
||
is committed to git alongside the current snapshot. The pipeline can
|
||
optionally render a simple text-based progress summary:
|
||
|
||
```
|
||
Metrics history (5 snapshots):
|
||
2026-02-10 ch 1/35 13 entities 41.7% VSM coverage
|
||
2026-02-11 ch 4/35 38 entities 50.0% VSM coverage
|
||
2026-02-11 ch 7/35 85 entities 58.3% VSM coverage
|
||
...
|
||
```
|
||
|
||
This provides the "metrics that improve over time" feedback loop the
|
||
README envisions: process chapters → evaluate → see coverage grow (or
|
||
flag regressions when a re-extraction reduces quality scores).
|
||
|
||
---
|
||
|
||
## Collection-Level Metrics (tasks 13-19)
|
||
|
||
These tasks implement the five collection-level concerns described in
|
||
`METRICS-METHODOLOGY.md`. They share underlying infrastructure (entity
|
||
metadata index, definition embeddings, relationship graph) that should
|
||
be built once per evaluation run.
|
||
|
||
See the methodology document for theoretical grounding, framework
|
||
references, and the full metric definitions per concern.
|
||
|
||
## 13. Entity metadata index — deterministic parsing layer — OPEN
|
||
|
||
**Depends on:** Task 10 (schema compliance checker shares parsing logic).
|
||
**Issue:** Several collection-level metrics (coverage matrix, FCA context,
|
||
granularity distribution) require structured metadata extracted from entity
|
||
files: H1 title, economic domain, VSM system(s), source chapter, section
|
||
presence, word counts. Currently this information exists only as prose
|
||
inside markdown files.
|
||
**Suggested fix:** Add a `parse_entity_metadata(path) -> EntityMeta`
|
||
function that extracts from each entity file:
|
||
|
||
```python
|
||
@dataclass
|
||
class EntityMeta:
|
||
slug: str
|
||
title: str # from H1
|
||
domain: str # from Economic Domain section
|
||
source_chapter: str # from Source Chapter section
|
||
definition_words: int # word count of Definition section
|
||
has_original_wording: bool # optional section present?
|
||
has_modern_interpretation: bool
|
||
vsm_systems: list[str] # from mapping file if exists
|
||
mapping_strengths: list[str]
|
||
```
|
||
|
||
Build an index of all entities at the start of each evaluation run.
|
||
This index is the input for tasks 14, 16, and 18. Expose as
|
||
`--index` CLI flag for inspection.
|
||
|
||
## 14. Redundancy detection (Concern C1) — OPEN
|
||
|
||
**Depends on:** Task 13 (metadata index).
|
||
**Methodology:** OOPS! P2 (synonymous classes) + embedding similarity +
|
||
LLM pairwise judgment. See METRICS-METHODOLOGY.md §4 C1.
|
||
**Issue:** Entities with different slugs but overlapping meanings (e.g.
|
||
`natural-rate` / `ordinary-or-average-rate`) survive extraction because
|
||
dedup only checks slug collisions. There is no semantic overlap detection.
|
||
**Suggested fix:** Implement in three stages:
|
||
|
||
1. **Embed** — Compute vector embeddings of all entity definitions using
|
||
an embedding API (OpenRouter, OpenAI, or a local sentence-transformer).
|
||
Cache embeddings in `output/metrics/embeddings.json` keyed by
|
||
`{slug: content_digest}` so unchanged entities skip re-embedding.
|
||
|
||
2. **Similarity matrix** — Compute NxN cosine similarity. Write the full
|
||
matrix to `output/metrics/similarity-matrix.json`. Flag all pairs with
|
||
cosine > 0.80 as candidates.
|
||
|
||
3. **LLM pairwise judgment** — For each candidate pair, run a prompt:
|
||
"Given these two entity definitions, are they (a) the same concept and
|
||
should be merged, (b) genuinely distinct, or (c) partially overlapping
|
||
and should be clarified?" Write results to
|
||
`output/metrics/redundancy-report.md` + YAML.
|
||
|
||
**Metrics produced:**
|
||
- `high_similarity_pairs`: count and list
|
||
- `confirmed_synonyms`: count (LLM-confirmed same concept)
|
||
- `redundancy_ratio`: `confirmed_synonyms / total_entities`
|
||
- `intensional_conciseness`: `1 - redundancy_ratio`
|
||
|
||
**CLI:** `--check-redundancy --provider <provider>`
|
||
|
||
## 15. Coverage completeness (Concern C2) — OPEN
|
||
|
||
**Depends on:** Task 13 (metadata index).
|
||
**Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency
|
||
questions. See METRICS-METHODOLOGY.md §4 C2.
|
||
**Issue:** Coverage is currently assessed by the LLM in a single narrative
|
||
pass. There is no structured view of which domain × VSM cells are
|
||
populated, and no way to test whether the entity set can answer specific
|
||
questions about the economic system.
|
||
**Suggested fix:** Implement in three stages:
|
||
|
||
1. **Domain × VSM matrix** — From the metadata index, count entities per
|
||
{economic_domain, vsm_system} cell. Render as a table. Identify empty
|
||
cells as specific, actionable gaps. Compute:
|
||
- `coverage_ratio = populated_cells / total_cells`
|
||
- `vsm_balance_entropy = -Σ(pᵢ log pᵢ)` across VSM systems
|
||
|
||
2. **FCA lattice** — Construct a formal context with objects = entities,
|
||
attributes = {domain, vsm_system, source_book, abstraction_level}.
|
||
Compute the concept lattice (Python `concepts` library). Extract
|
||
attribute combinations with no corresponding entity — these are
|
||
**structural coverage gaps** not visible in the simple matrix.
|
||
|
||
3. **Competency questions** — Define a set of 15-20 canonical questions
|
||
the infospace should answer (stored in
|
||
`schemas/competency-questions.md`). Example questions:
|
||
- "How does the division of labour relate to market extent?"
|
||
- "What mechanisms regulate wages toward their natural rate?"
|
||
- "How do monopolies distort the viable system?"
|
||
LLM-Eval tests whether current entities suffice to answer each.
|
||
Unanswerable questions identify specific completeness gaps.
|
||
|
||
**Metrics produced:**
|
||
- `domain_vsm_matrix`: cell counts
|
||
- `coverage_ratio`: scalar
|
||
- `vsm_balance_entropy`: scalar
|
||
- `empty_cells`: list of {domain, vsm_system} gaps
|
||
- `fca_gap_concepts`: attribute combos with no entity
|
||
- `competency_coverage`: fraction of questions answerable
|
||
|
||
**CLI:** `--check-coverage --provider <provider>`
|
||
|
||
## 16. Structural coherence (Concern C3) — OPEN
|
||
|
||
**Depends on:** Task 13 (metadata index).
|
||
**Methodology:** OntoQA relationship richness + graph connectivity +
|
||
community detection. See METRICS-METHODOLOGY.md §4 C3.
|
||
**Issue:** It is unknown whether the 85 entities form a connected
|
||
explanatory web or a fragmented collection. No relationship graph exists
|
||
between entities.
|
||
**Suggested fix:** Implement in three stages:
|
||
|
||
1. **Explicit cross-references** — Scan each entity's definition for
|
||
mentions of other entity slugs or titles (normalised string matching).
|
||
This is deterministic and catches direct references.
|
||
|
||
2. **LLM-inferred edges** — For entity pairs not caught by string
|
||
matching but in the same domain or VSM system, LLM-Eval: "Does A's
|
||
definition conceptually depend on or explain B, or vice versa?" Run
|
||
in batches. Write the combined graph to
|
||
`output/metrics/relationship-graph.json` (adjacency list).
|
||
|
||
3. **Graph analysis** — Using networkx or equivalent:
|
||
- Connected components (target: 1)
|
||
- Graph density, average degree
|
||
- Betweenness centrality → identify bridge concepts
|
||
- Louvain community detection → compare to declared domains
|
||
- OntoQA Relationship Richness
|
||
- Cohesion per domain, coupling across domains
|
||
- Orphan entities (degree 0 or 1)
|
||
|
||
**Metrics produced:**
|
||
- `connected_components`: count (target: 1)
|
||
- `graph_density`: scalar
|
||
- `avg_degree`: scalar
|
||
- `relationship_richness`: OntoQA RR
|
||
- `modularity`: Louvain score
|
||
- `bridge_concepts`: list (high betweenness centrality)
|
||
- `orphan_entities`: list (degree ≤ 1)
|
||
- `cohesion_by_domain` / `coupling_across_domains`: scalars
|
||
|
||
**CLI:** `--check-coherence --provider <provider>`
|
||
|
||
## 17. Definitional consistency (Concern C4) — OPEN
|
||
|
||
**Depends on:** Task 16 (relationship graph — the definitional dependency
|
||
graph is a directed variant of the same structure).
|
||
**Methodology:** OntoClean metaproperties + OOPS! P24 (circular
|
||
definitions) + SEQUAL validity. See METRICS-METHODOLOGY.md §4 C4.
|
||
**Issue:** No mechanism to detect circular definitions, contradictions
|
||
between related entities, or terms used in definitions that should be
|
||
entities but aren't.
|
||
**Suggested fix:** Implement in four stages:
|
||
|
||
1. **Definitional dependency graph** — Directed version of the
|
||
relationship graph: edge A→B means A's definition uses B's concept.
|
||
Reuse cross-reference extraction from task 16.
|
||
|
||
2. **Cycle detection** — Find all cycles of length ≤ 3 in the directed
|
||
graph. Short cycles are problematic (A defines B, B defines A).
|
||
Compute `grounding_ratio`: fraction of entities traceable to terms
|
||
outside the entity set without encountering a cycle.
|
||
|
||
3. **Undefined dependencies** — Extract terms from definitions that match
|
||
entity-name patterns (capitalised noun phrases, kebab-case slugs) but
|
||
have no corresponding entity file. These are concepts the infospace
|
||
implicitly relies on but hasn't defined.
|
||
|
||
4. **LLM consistency checks** — For directly-connected entity pairs,
|
||
LLM-Eval: "Do these definitions contradict each other?" For entities
|
||
with Smith's Original Wording, LLM-Eval: "Does the definition
|
||
accurately represent the cited passage?"
|
||
|
||
**Metrics produced:**
|
||
- `circular_definitions`: count and list of cycles (length ≤ 3)
|
||
- `grounding_ratio`: fraction of entities reaching primitives
|
||
- `undefined_dependencies`: list of missing terms
|
||
- `contradiction_candidates`: LLM-flagged pairs
|
||
- `source_fidelity_score`: fraction passing source check
|
||
|
||
**CLI:** `--check-consistency --provider <provider>`
|
||
|
||
## 18. Granularity balance (Concern C5) — OPEN
|
||
|
||
**Depends on:** Task 13 (metadata index).
|
||
**Methodology:** Keet granularity theory + OntoClean rigidity +
|
||
DSL laconicity. See METRICS-METHODOLOGY.md §4 C5.
|
||
**Issue:** Entities range from broad sectors (`agriculture`) to specific
|
||
market roles (`effectual-demanders`) to abstract principles
|
||
(`division-of-labour`). It is unclear whether this range is appropriate
|
||
or whether some entities are too specific/general relative to their peers.
|
||
**Suggested fix:** Implement in three stages:
|
||
|
||
1. **LLM classification** — For each entity, LLM-Eval assigns:
|
||
- Abstraction level: `theory` / `mechanism` / `observation`
|
||
- Scope score: 1-5 (very specific → very general)
|
||
- Indispensability: 1-5 ("if removed, how much explanatory power lost?")
|
||
Write to `output/evaluations/<slug>-classification.yaml`.
|
||
|
||
2. **Distribution analysis** — Deterministic:
|
||
- Count per abstraction level; compute entropy
|
||
- Per-domain scope variance (flag domains with high variance)
|
||
- Level × domain matrix (from FCA context in task 15)
|
||
- Outlier detection: entities > 1.5σ from their domain's mean scope
|
||
|
||
3. **Merge/split recommendations** — For outlier entities, LLM-Eval:
|
||
"Should this entity be merged into a broader concept, split into
|
||
sub-concepts, or is its current granularity justified?" For entities
|
||
with indispensability ≤ 2: "Could another entity serve this purpose?"
|
||
|
||
**Metrics produced:**
|
||
- `abstraction_distribution`: {theory: n, mechanism: n, observation: n}
|
||
- `abstraction_entropy`: scalar (higher = more balanced)
|
||
- `scope_variance_by_domain`: per-domain scalar
|
||
- `dispensable_entities`: list (indispensability ≤ 2)
|
||
- `merge_candidates`: list of pairs
|
||
- `split_candidates`: list of entities
|
||
|
||
**CLI:** `--check-granularity --provider <provider>`
|
||
|
||
## 19. Unified collection evaluation command — OPEN
|
||
|
||
**Depends on:** Tasks 13-18.
|
||
**Issue:** Running five separate `--check-*` commands is cumbersome and
|
||
repeats shared computation (metadata parsing, embedding, graph building).
|
||
**Suggested fix:** Add `--evaluate-collection --provider <provider>` that
|
||
runs all five checks in sequence, sharing infrastructure:
|
||
|
||
1. Parse entity metadata index (task 13) — used by all
|
||
2. Compute embeddings (task 14) — used by C1, C3
|
||
3. Build relationship graph (task 16) — used by C3, C4
|
||
4. Run all five concern checks
|
||
5. Write per-concern reports to `output/metrics/`
|
||
6. Write unified `metrics.yaml` with all collection metrics
|
||
7. Append to `metrics-history.yaml` (task 12)
|
||
|
||
Incremental mode: `--evaluate-collection --chapter <id>` re-evaluates
|
||
only entities from that chapter plus pairwise checks involving them.
|
||
|
||
Report a summary to stdout:
|
||
|
||
```
|
||
Collection evaluation (85 entities, 7 chapters):
|
||
Redundancy: 3 synonym candidates, conciseness 0.96
|
||
Coverage: 58% VSM, 20% chapters, 4 domain gaps
|
||
Coherence: 1 component, density 0.12, 2 orphans
|
||
Consistency: 0 cycles, 5 undefined deps, 0 contradictions
|
||
Granularity: entropy 1.42, 1 dispensable, 2 merge candidates
|
||
```
|