feat(infospace): add eval-summary command and improve evaluate pipeline (S3.3)
- Fix evaluate dimensions to match template file: definition_precision, source_grounding, domain_placement, vsm_relevance, explanatory_value (was domain_relevance, discipline_alignment, conceptual_clarity) - Add VSM background context to evaluation prompt so LLM can score vsm_relevance without macro injection - Fix model_name bug: was sending literal "default" to API (HTTP 400) - Refactor run_entity_evaluation to write files incrementally via callback rather than all at once after the batch — long runs are now resumable if interrupted - Add incremental skip in CLI: entities with existing eval files are skipped automatically on re-run (acts as resume) - Add eval-summary command: reads all eval files, shows per-dimension means, optionally writes per_entity_mean to metrics.yaml - Fix record_check_results to merge rather than overwrite metrics.yaml so per_entity_mean survives subsequent check runs - Add per_entity_mean viability threshold (min: 3.5) to infospace.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -37,6 +37,8 @@ viability:
|
||||
max: 0
|
||||
granularity_entropy:
|
||||
min: 1.0
|
||||
per_entity_mean:
|
||||
min: 3.5 # LLM quality score across 5 dimensions (1-5 scale)
|
||||
|
||||
pipeline:
|
||||
stages:
|
||||
|
||||
@@ -934,3 +934,29 @@
|
||||
concern: C1
|
||||
metadata:
|
||||
source: collection-checks
|
||||
- snapshot_id: 090bb961
|
||||
created_at: '2026-02-23T00:22:25.818146+00:00'
|
||||
schema_name: default
|
||||
entity_count: 988
|
||||
entity_evaluations: []
|
||||
collection_metrics:
|
||||
- name: coherence_components
|
||||
value: 0.0
|
||||
concern: C3
|
||||
- name: consistency_cycles
|
||||
value: 0.0
|
||||
concern: C4
|
||||
- name: coverage_ratio
|
||||
value: 0.6190476190476191
|
||||
concern: C2
|
||||
- name: granularity_entropy
|
||||
value: 2.6747519428200657
|
||||
concern: C5
|
||||
- name: modularity
|
||||
value: 0.0
|
||||
concern: C3
|
||||
- name: redundancy_ratio
|
||||
value: 0.006072874493927126
|
||||
concern: C1
|
||||
metadata:
|
||||
source: collection-checks
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
coherence_components: 0.0
|
||||
consistency_cycles: 0.0
|
||||
coverage_ratio: 0.442424
|
||||
granularity_entropy: 2.953326
|
||||
coverage_ratio: 0.619048
|
||||
granularity_entropy: 2.674752
|
||||
modularity: 0.0
|
||||
redundancy_ratio: 0.005877
|
||||
per_entity_mean: 4.42
|
||||
redundancy_ratio: 0.006073
|
||||
|
||||
70
examples/infospace-with-history/templates/evaluate-entity.md
Normal file
70
examples/infospace-with-history/templates/evaluate-entity.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Evaluate Economic Entity
|
||||
|
||||
You are a quality assessor evaluating a single economic entity extracted from
|
||||
Adam Smith's *The Wealth of Nations* and mapped to Stafford Beer's Viable
|
||||
System Model. Your task is to score the entity on five quality dimensions
|
||||
and produce a structured evaluation.
|
||||
|
||||
## Entity Under Evaluation
|
||||
|
||||
@{entity_content}
|
||||
|
||||
## Source Chapter
|
||||
|
||||
@{source_chapter}
|
||||
|
||||
## VSM Framework Reference
|
||||
|
||||
@{vsm_framework}
|
||||
|
||||
## Quality Rubric
|
||||
|
||||
@{quality_rubric}
|
||||
|
||||
## Instructions
|
||||
|
||||
1. Read the entity carefully, including its definition, source chapter,
|
||||
context, economic domain, and any VSM mapping information provided.
|
||||
2. Locate the relevant passage in the source chapter to verify source grounding.
|
||||
3. Consult the VSM framework reference to assess VSM relevance.
|
||||
4. Score each dimension 1–5 using the rubric above. Use the full range:
|
||||
reserve 5 for genuinely excellent entries and 1 for clear failures.
|
||||
5. For each dimension, write exactly one sentence justifying the score.
|
||||
6. Compute the overall score as the mean of the five dimension scores,
|
||||
rounded to two decimal places.
|
||||
7. List any flags for issues that warrant attention (empty list if none).
|
||||
Valid flags: `circular-definition`, `missing-citation`, `wrong-domain`,
|
||||
`no-vsm-mapping`, `redundant-with-<slug>`, `overclaimed-strength`,
|
||||
`underclaimed-strength`.
|
||||
|
||||
## Output Format
|
||||
|
||||
Output YAML front-matter (scores + flags) followed by a markdown section
|
||||
with per-dimension justifications. Do not include any other text outside
|
||||
this structure.
|
||||
|
||||
```
|
||||
---
|
||||
entity: <slug of the entity, kebab-case>
|
||||
scores:
|
||||
definition_precision: <1-5>
|
||||
source_grounding: <1-5>
|
||||
domain_placement: <1-5>
|
||||
vsm_relevance: <1-5>
|
||||
explanatory_value: <1-5>
|
||||
overall: <mean rounded to 2 decimal places>
|
||||
flags: []
|
||||
---
|
||||
|
||||
## Justifications
|
||||
|
||||
**Definition Precision (<score>/5):** <one sentence>
|
||||
|
||||
**Source Grounding (<score>/5):** <one sentence>
|
||||
|
||||
**Domain Placement (<score>/5):** <one sentence>
|
||||
|
||||
**VSM Relevance (<score>/5):** <one sentence>
|
||||
|
||||
**Explanatory Value (<score>/5):** <one sentence>
|
||||
```
|
||||
Reference in New Issue
Block a user