feat(infospace): add eval-summary command and improve evaluate pipeline (S3.3)

- Fix evaluate dimensions to match template file:
  definition_precision, source_grounding, domain_placement,
  vsm_relevance, explanatory_value (was domain_relevance,
  discipline_alignment, conceptual_clarity)
- Add VSM background context to evaluation prompt so LLM can
  score vsm_relevance without macro injection
- Fix model_name bug: was sending literal "default" to API (HTTP 400)
- Refactor run_entity_evaluation to write files incrementally via
  callback rather than all at once after the batch — long runs are
  now resumable if interrupted
- Add incremental skip in CLI: entities with existing eval files
  are skipped automatically on re-run (acts as resume)
- Add eval-summary command: reads all eval files, shows per-dimension
  means, optionally writes per_entity_mean to metrics.yaml
- Fix record_check_results to merge rather than overwrite metrics.yaml
  so per_entity_mean survives subsequent check runs
- Add per_entity_mean viability threshold (min: 3.5) to infospace.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-23 01:26:45 +01:00
parent 574bb11db6
commit 7f1eecbdb2
7 changed files with 242 additions and 42 deletions

View File

@@ -37,6 +37,8 @@ viability:
max: 0
granularity_entropy:
min: 1.0
per_entity_mean:
min: 3.5 # LLM quality score across 5 dimensions (1-5 scale)
pipeline:
stages:

View File

@@ -934,3 +934,29 @@
concern: C1
metadata:
source: collection-checks
- snapshot_id: 090bb961
created_at: '2026-02-23T00:22:25.818146+00:00'
schema_name: default
entity_count: 988
entity_evaluations: []
collection_metrics:
- name: coherence_components
value: 0.0
concern: C3
- name: consistency_cycles
value: 0.0
concern: C4
- name: coverage_ratio
value: 0.6190476190476191
concern: C2
- name: granularity_entropy
value: 2.6747519428200657
concern: C5
- name: modularity
value: 0.0
concern: C3
- name: redundancy_ratio
value: 0.006072874493927126
concern: C1
metadata:
source: collection-checks

View File

@@ -1,6 +1,7 @@
coherence_components: 0.0
consistency_cycles: 0.0
coverage_ratio: 0.442424
granularity_entropy: 2.953326
coverage_ratio: 0.619048
granularity_entropy: 2.674752
modularity: 0.0
redundancy_ratio: 0.005877
per_entity_mean: 4.42
redundancy_ratio: 0.006073

View File

@@ -0,0 +1,70 @@
# Evaluate Economic Entity
You are a quality assessor evaluating a single economic entity extracted from
Adam Smith's *The Wealth of Nations* and mapped to Stafford Beer's Viable
System Model. Your task is to score the entity on five quality dimensions
and produce a structured evaluation.
## Entity Under Evaluation
@{entity_content}
## Source Chapter
@{source_chapter}
## VSM Framework Reference
@{vsm_framework}
## Quality Rubric
@{quality_rubric}
## Instructions
1. Read the entity carefully, including its definition, source chapter,
context, economic domain, and any VSM mapping information provided.
2. Locate the relevant passage in the source chapter to verify source grounding.
3. Consult the VSM framework reference to assess VSM relevance.
4. Score each dimension 15 using the rubric above. Use the full range:
reserve 5 for genuinely excellent entries and 1 for clear failures.
5. For each dimension, write exactly one sentence justifying the score.
6. Compute the overall score as the mean of the five dimension scores,
rounded to two decimal places.
7. List any flags for issues that warrant attention (empty list if none).
Valid flags: `circular-definition`, `missing-citation`, `wrong-domain`,
`no-vsm-mapping`, `redundant-with-<slug>`, `overclaimed-strength`,
`underclaimed-strength`.
## Output Format
Output YAML front-matter (scores + flags) followed by a markdown section
with per-dimension justifications. Do not include any other text outside
this structure.
```
---
entity: <slug of the entity, kebab-case>
scores:
definition_precision: <1-5>
source_grounding: <1-5>
domain_placement: <1-5>
vsm_relevance: <1-5>
explanatory_value: <1-5>
overall: <mean rounded to 2 decimal places>
flags: []
---
## Justifications
**Definition Precision (<score>/5):** <one sentence>
**Source Grounding (<score>/5):** <one sentence>
**Domain Placement (<score>/5):** <one sentence>
**VSM Relevance (<score>/5):** <one sentence>
**Explanatory Value (<score>/5):** <one sentence>
```