feat(infospace): add eval-summary command and improve evaluate pipeline (S3.3)

- Fix evaluate dimensions to match template file: definition_precision, source_grounding, domain_placement, vsm_relevance, explanatory_value (was domain_relevance, discipline_alignment, conceptual_clarity) - Add VSM background context to evaluation prompt so LLM can score vsm_relevance without macro injection - Fix model_name bug: was sending literal "default" to API (HTTP 400) - Refactor run_entity_evaluation to write files incrementally via callback rather than all at once after the batch — long runs are now resumable if interrupted - Add incremental skip in CLI: entities with existing eval files are skipped automatically on re-run (acts as resume) - Add eval-summary command: reads all eval files, shows per-dimension means, optionally writes per_entity_mean to metrics.yaml - Fix record_check_results to merge rather than overwrite metrics.yaml so per_entity_mean survives subsequent check runs - Add per_entity_mean viability threshold (min: 3.5) to infospace.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 01:26:45 +01:00
parent 574bb11db6
commit 7f1eecbdb2
7 changed files with 242 additions and 42 deletions
--- a/examples/infospace-with-history/infospace.yaml
+++ b/examples/infospace-with-history/infospace.yaml
@@ -37,6 +37,8 @@ viability:
    max: 0
  granularity_entropy:
    min: 1.0
+  per_entity_mean:
+    min: 3.5  # LLM quality score across 5 dimensions (1-5 scale)

 pipeline:
  stages:
--- a/examples/infospace-with-history/output/metrics/history.yaml
+++ b/examples/infospace-with-history/output/metrics/history.yaml
@@ -934,3 +934,29 @@
    concern: C1
  metadata:
    source: collection-checks
+- snapshot_id: 090bb961
+  created_at: '2026-02-23T00:22:25.818146+00:00'
+  schema_name: default
+  entity_count: 988
+  entity_evaluations: []
+  collection_metrics:
+  - name: coherence_components
+    value: 0.0
+    concern: C3
+  - name: consistency_cycles
+    value: 0.0
+    concern: C4
+  - name: coverage_ratio
+    value: 0.6190476190476191
+    concern: C2
+  - name: granularity_entropy
+    value: 2.6747519428200657
+    concern: C5
+  - name: modularity
+    value: 0.0
+    concern: C3
+  - name: redundancy_ratio
+    value: 0.006072874493927126
+    concern: C1
+  metadata:
+    source: collection-checks
--- a/examples/infospace-with-history/output/metrics/metrics.yaml
+++ b/examples/infospace-with-history/output/metrics/metrics.yaml
@@ -1,6 +1,7 @@
 coherence_components: 0.0
 consistency_cycles: 0.0
-coverage_ratio: 0.442424
-granularity_entropy: 2.953326
+coverage_ratio: 0.619048
+granularity_entropy: 2.674752
 modularity: 0.0
-redundancy_ratio: 0.005877
+per_entity_mean: 4.42
+redundancy_ratio: 0.006073
--- a/examples/infospace-with-history/templates/evaluate-entity.md
+++ b/examples/infospace-with-history/templates/evaluate-entity.md
@@ -0,0 +1,70 @@
+# Evaluate Economic Entity
+
+You are a quality assessor evaluating a single economic entity extracted from
+Adam Smith's *The Wealth of Nations* and mapped to Stafford Beer's Viable
+System Model. Your task is to score the entity on five quality dimensions
+and produce a structured evaluation.
+
+## Entity Under Evaluation
+
+@{entity_content}
+
+## Source Chapter
+
+@{source_chapter}
+
+## VSM Framework Reference
+
+@{vsm_framework}
+
+## Quality Rubric
+
+@{quality_rubric}
+
+## Instructions
+
+1. Read the entity carefully, including its definition, source chapter,
+   context, economic domain, and any VSM mapping information provided.
+2. Locate the relevant passage in the source chapter to verify source grounding.
+3. Consult the VSM framework reference to assess VSM relevance.
+4. Score each dimension 1–5 using the rubric above. Use the full range:
+   reserve 5 for genuinely excellent entries and 1 for clear failures.
+5. For each dimension, write exactly one sentence justifying the score.
+6. Compute the overall score as the mean of the five dimension scores,
+   rounded to two decimal places.
+7. List any flags for issues that warrant attention (empty list if none).
+   Valid flags: `circular-definition`, `missing-citation`, `wrong-domain`,
+   `no-vsm-mapping`, `redundant-with-<slug>`, `overclaimed-strength`,
+   `underclaimed-strength`.
+
+## Output Format
+
+Output YAML front-matter (scores + flags) followed by a markdown section
+with per-dimension justifications. Do not include any other text outside
+this structure.
+
+```
+---
+entity: <slug of the entity, kebab-case>
+scores:
+  definition_precision: <1-5>
+  source_grounding: <1-5>
+  domain_placement: <1-5>
+  vsm_relevance: <1-5>
+  explanatory_value: <1-5>
+overall: <mean rounded to 2 decimal places>
+flags: []
+---
+
+## Justifications
+
+**Definition Precision (<score>/5):** <one sentence>
+
+**Source Grounding (<score>/5):** <one sentence>
+
+**Domain Placement (<score>/5):** <one sentence>
+
+**VSM Relevance (<score>/5):** <one sentence>
+
+**Explanatory Value (<score>/5):** <one sentence>
+```