Files

tegwick 7f1eecbdb2 feat(infospace): add eval-summary command and improve evaluate pipeline (S3.3)

- Fix evaluate dimensions to match template file:
  definition_precision, source_grounding, domain_placement,
  vsm_relevance, explanatory_value (was domain_relevance,
  discipline_alignment, conceptual_clarity)
- Add VSM background context to evaluation prompt so LLM can
  score vsm_relevance without macro injection
- Fix model_name bug: was sending literal "default" to API (HTTP 400)
- Refactor run_entity_evaluation to write files incrementally via
  callback rather than all at once after the batch — long runs are
  now resumable if interrupted
- Add incremental skip in CLI: entities with existing eval files
  are skipped automatically on re-run (acts as resume)
- Add eval-summary command: reads all eval files, shows per-dimension
  means, optionally writes per_entity_mean to metrics.yaml
- Fix record_check_results to merge rather than overwrite metrics.yaml
  so per_entity_mean survives subsequent check runs
- Add per_entity_mean viability threshold (min: 3.5) to infospace.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-23 01:26:45 +01:00

2.0 KiB

Raw Blame History

Evaluate Economic Entity

You are a quality assessor evaluating a single economic entity extracted from Adam Smith's The Wealth of Nations and mapped to Stafford Beer's Viable System Model. Your task is to score the entity on five quality dimensions and produce a structured evaluation.

Entity Under Evaluation

@{entity_content}

Source Chapter

@{source_chapter}

VSM Framework Reference

@{vsm_framework}

Quality Rubric

@{quality_rubric}

Instructions

Read the entity carefully, including its definition, source chapter, context, economic domain, and any VSM mapping information provided.
Locate the relevant passage in the source chapter to verify source grounding.
Consult the VSM framework reference to assess VSM relevance.
Score each dimension 1–5 using the rubric above. Use the full range: reserve 5 for genuinely excellent entries and 1 for clear failures.
For each dimension, write exactly one sentence justifying the score.
Compute the overall score as the mean of the five dimension scores, rounded to two decimal places.
List any flags for issues that warrant attention (empty list if none). Valid flags: circular-definition, missing-citation, wrong-domain, no-vsm-mapping, redundant-with-<slug>, overclaimed-strength, underclaimed-strength.

Output Format

Output YAML front-matter (scores + flags) followed by a markdown section with per-dimension justifications. Do not include any other text outside this structure.

---
entity: <slug of the entity, kebab-case>
scores:
  definition_precision: <1-5>
  source_grounding: <1-5>
  domain_placement: <1-5>
  vsm_relevance: <1-5>
  explanatory_value: <1-5>
overall: <mean rounded to 2 decimal places>
flags: []
---

## Justifications

**Definition Precision (<score>/5):** <one sentence>

**Source Grounding (<score>/5):** <one sentence>

**Domain Placement (<score>/5):** <one sentence>

**VSM Relevance (<score>/5):** <one sentence>

**Explanatory Value (<score>/5):** <one sentence>

2.0 KiB Raw Blame History Unescape Escape