docs(tutorial): update §8-9 for eval-summary command and 6/6 viability
- Add eval-summary command documentation with dimension descriptions - Document resumable evaluate (incremental skip on re-run) - Fix --entity slug example to use underscores (not hyphens) - Update viability output to show per_entity_mean as 6th threshold - Add workflow note: check → eval-summary --update-metrics → viability Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -391,13 +391,54 @@ markitect infospace evaluate --provider openrouter
|
|||||||
# Evaluate entities from a specific chapter:
|
# Evaluate entities from a specific chapter:
|
||||||
markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter
|
markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter
|
||||||
|
|
||||||
# Re-evaluate a single entity:
|
# Re-evaluate a single entity (slugs use underscores):
|
||||||
markitect infospace evaluate --entity division-of-labour --provider openrouter
|
markitect infospace evaluate --entity division_of_labour --provider openrouter
|
||||||
```
|
```
|
||||||
|
|
||||||
This runs the `evaluate-entity` prompt template against each entity,
|
The command is resumable: entities with existing evaluation files are
|
||||||
scoring dimensions like definition precision, source grounding, and
|
skipped automatically. Re-run after an interruption and it picks up
|
||||||
VSM relevance. Results are written to `output/evaluations/`.
|
where it left off. Results are written incrementally to
|
||||||
|
`output/evaluations/<slug>.md`.
|
||||||
|
|
||||||
|
Each entity is scored on five dimensions (1–5 scale):
|
||||||
|
- `definition_precision` — Is the definition precise and non-circular?
|
||||||
|
- `source_grounding` — Is the entity grounded in the actual source text?
|
||||||
|
- `domain_placement` — Is the economic domain assignment correct?
|
||||||
|
- `vsm_relevance` — Does the entity map naturally to a VSM system (S1–S5)?
|
||||||
|
- `explanatory_value` — Does the entity add genuine explanatory power?
|
||||||
|
|
||||||
|
### Evaluation summary
|
||||||
|
|
||||||
|
After the evaluation run completes, compute aggregate statistics:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Show per-dimension means:
|
||||||
|
markitect infospace eval-summary
|
||||||
|
|
||||||
|
# Also write per_entity_mean to metrics.yaml for viability checks:
|
||||||
|
markitect infospace eval-summary --update-metrics
|
||||||
|
```
|
||||||
|
|
||||||
|
Sample output (full corpus, 988 entities):
|
||||||
|
|
||||||
|
```
|
||||||
|
Evaluation summary — 988 entities evaluated
|
||||||
|
|
||||||
|
Dimension Mean
|
||||||
|
--------------------------------------
|
||||||
|
overall 4.XX
|
||||||
|
definition_precision 4.XX
|
||||||
|
domain_placement X.XX
|
||||||
|
explanatory_value 4.XX
|
||||||
|
source_grounding 4.XX
|
||||||
|
vsm_relevance 3.XX
|
||||||
|
|
||||||
|
Range: X.XX – X.XX
|
||||||
|
```
|
||||||
|
|
||||||
|
`vsm_relevance` typically scores lower than the other dimensions —
|
||||||
|
many WoN concepts are foundational economic ideas that don't map
|
||||||
|
cleanly to a single VSM system. This is expected and informative.
|
||||||
|
|
||||||
### Collection-level checks (C1–C5)
|
### Collection-level checks (C1–C5)
|
||||||
|
|
||||||
@@ -459,15 +500,20 @@ Compares the latest metrics against the thresholds declared in
|
|||||||
```
|
```
|
||||||
Metric Value Threshold Status
|
Metric Value Threshold Status
|
||||||
---------------------------------------------------------------
|
---------------------------------------------------------------
|
||||||
redundancy_ratio 0.0059 max=0.1 PASS
|
redundancy_ratio 0.0061 max=0.1 PASS
|
||||||
coverage_ratio 0.6190 min=0.4 PASS
|
coverage_ratio 0.6190 min=0.4 PASS
|
||||||
coherence_components 0.0000 max=3 PASS
|
coherence_components 0.0000 max=3 PASS
|
||||||
consistency_cycles 0.0000 max=0 PASS
|
consistency_cycles 0.0000 max=0 PASS
|
||||||
granularity_entropy 2.9533 min=1.0 PASS
|
granularity_entropy 2.6748 min=1.0 PASS
|
||||||
|
per_entity_mean 4.XXXX min=3.5 PASS
|
||||||
|
|
||||||
Viable: YES (5/5 thresholds met)
|
Viable: YES (6/6 thresholds met)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
`per_entity_mean` only appears after running `eval-summary --update-metrics`.
|
||||||
|
Run `check` first (deterministic), then `eval-summary --update-metrics`,
|
||||||
|
then `viability` to see the full six-threshold dashboard.
|
||||||
|
|
||||||
During early processing (first few books), coverage will fall and
|
During early processing (first few books), coverage will fall and
|
||||||
then stabilise as the domain × chapter matrix fills in. The threshold
|
then stabilise as the domain × chapter matrix fills in. The threshold
|
||||||
of 0.40 reflects realistic expectations for a multi-book corpus where
|
of 0.40 reflects realistic expectations for a multi-book corpus where
|
||||||
|
|||||||
Reference in New Issue
Block a user