feat(example): add per-entity LLM evaluations for 985 WoN entities (S3.3)

Batch evaluation of all 988 entities via OpenRouter. 984 succeeded on
first pass; 3 failed (network errors). eval-summary --update-metrics
written with per_entity_mean=3.9556.

Viability dashboard: 6/6 PASS
  redundancy_ratio   0.0061  (max 0.10)
  coverage_ratio     0.6190  (min 0.40)
  coherence_comps    0.0000  (max 3)
  consistency_cycles 0.0000  (max 0)
  granularity_entropy 2.6748 (min 1.0)
  per_entity_mean    3.9556  (min 3.5)

Dimension breakdown (mean across 985 entities):
  definition_precision  3.62
  source_grounding      4.36
  domain_placement      4.56
  vsm_relevance         3.31
  explanatory_value     3.94

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-23 09:36:46 +01:00
parent 81a4c8796a
commit a9ca0adfcf
986 changed files with 63216 additions and 1 deletions

View File

@@ -0,0 +1,63 @@
---
entity_slug: fraud_in_drawback_system
evaluator: null
evaluated_at: '2026-02-23T05:31:00.842843'
overall_score: 2.4
scores:
- name: definition_precision
value: 1.0
max_value: 5.0
rationale: There is no definition provided at all, making it impossible to assess
precision or distinctness. The entity exists only as a title without any conceptual
content.
- name: source_grounding
value: 2.0
max_value: 5.0
rationale: While Smith does discuss drawbacks (export bounties/refunds) and mentions
potential for abuse in tax systems, the specific framing as "fraud in drawback
system" may not be explicitly articulated as a distinct concept in the source
text. Without seeing the actual definition and context, it's unclear if this represents
Smith's own conceptualization.
- name: domain_placement
value: 3.0
max_value: 5.0
rationale: The concept would logically belong in public finance or trade policy
domains, which are central to Smith's work, but without a specified domain or
definition, proper placement cannot be confirmed. The economic relevance is apparent
but underspecified.
- name: vsm_relevance
value: 4.0
max_value: 5.0
rationale: This entity would map well to S3 (internal regulation/audit) as it concerns
detecting and preventing abuse within government financial systems. The concept
has clear VSM relevance for control and monitoring functions.
- name: explanatory_value
value: 2.0
max_value: 5.0
rationale: While fraud in tax/trade systems could illuminate important structural
weaknesses in government finance, without any definition or context provided,
this entity currently offers no explanatory power beyond naming a potential problem
area. It remains a surface-level label rather than an analytical concept.
---
# Evaluation: Fraud In Drawback System
## definition_precision — 1.0 / 5.0
There is no definition provided at all, making it impossible to assess precision or distinctness. The entity exists only as a title without any conceptual content.
## source_grounding — 2.0 / 5.0
While Smith does discuss drawbacks (export bounties/refunds) and mentions potential for abuse in tax systems, the specific framing as "fraud in drawback system" may not be explicitly articulated as a distinct concept in the source text. Without seeing the actual definition and context, it's unclear if this represents Smith's own conceptualization.
## domain_placement — 3.0 / 5.0
The concept would logically belong in public finance or trade policy domains, which are central to Smith's work, but without a specified domain or definition, proper placement cannot be confirmed. The economic relevance is apparent but underspecified.
## vsm_relevance — 4.0 / 5.0
This entity would map well to S3 (internal regulation/audit) as it concerns detecting and preventing abuse within government financial systems. The concept has clear VSM relevance for control and monitoring functions.
## explanatory_value — 2.0 / 5.0
While fraud in tax/trade systems could illuminate important structural weaknesses in government finance, without any definition or context provided, this entity currently offers no explanatory power beyond naming a potential problem area. It remains a surface-level label rather than an analytical concept.