From 34ed7a6fab601930ffbe2ff0e3f2aa323014ab04 Mon Sep 17 00:00:00 2001 From: tegwick Date: Mon, 23 Feb 2026 05:33:11 +0100 Subject: [PATCH] =?UTF-8?q?docs(tutorial):=20update=20=C2=A78-9=20for=20ev?= =?UTF-8?q?al-summary=20command=20and=206/6=20viability?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add eval-summary command documentation with dimension descriptions - Document resumable evaluate (incremental skip on re-run) - Fix --entity slug example to use underscores (not hyphens) - Update viability output to show per_entity_mean as 6th threshold - Add workflow note: check → eval-summary --update-metrics → viability Co-Authored-By: Claude Sonnet 4.6 --- examples/infospace-with-history/TUTORIAL.md | 62 ++++++++++++++++++--- 1 file changed, 54 insertions(+), 8 deletions(-) diff --git a/examples/infospace-with-history/TUTORIAL.md b/examples/infospace-with-history/TUTORIAL.md index 5144e44c..f5718f28 100644 --- a/examples/infospace-with-history/TUTORIAL.md +++ b/examples/infospace-with-history/TUTORIAL.md @@ -391,13 +391,54 @@ markitect infospace evaluate --provider openrouter # Evaluate entities from a specific chapter: markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter -# Re-evaluate a single entity: -markitect infospace evaluate --entity division-of-labour --provider openrouter +# Re-evaluate a single entity (slugs use underscores): +markitect infospace evaluate --entity division_of_labour --provider openrouter ``` -This runs the `evaluate-entity` prompt template against each entity, -scoring dimensions like definition precision, source grounding, and -VSM relevance. Results are written to `output/evaluations/`. +The command is resumable: entities with existing evaluation files are +skipped automatically. Re-run after an interruption and it picks up +where it left off. Results are written incrementally to +`output/evaluations/.md`. + +Each entity is scored on five dimensions (1–5 scale): +- `definition_precision` — Is the definition precise and non-circular? +- `source_grounding` — Is the entity grounded in the actual source text? +- `domain_placement` — Is the economic domain assignment correct? +- `vsm_relevance` — Does the entity map naturally to a VSM system (S1–S5)? +- `explanatory_value` — Does the entity add genuine explanatory power? + +### Evaluation summary + +After the evaluation run completes, compute aggregate statistics: + +```bash +# Show per-dimension means: +markitect infospace eval-summary + +# Also write per_entity_mean to metrics.yaml for viability checks: +markitect infospace eval-summary --update-metrics +``` + +Sample output (full corpus, 988 entities): + +``` +Evaluation summary — 988 entities evaluated + + Dimension Mean + -------------------------------------- + overall 4.XX + definition_precision 4.XX + domain_placement X.XX + explanatory_value 4.XX + source_grounding 4.XX + vsm_relevance 3.XX + + Range: X.XX – X.XX +``` + +`vsm_relevance` typically scores lower than the other dimensions — +many WoN concepts are foundational economic ideas that don't map +cleanly to a single VSM system. This is expected and informative. ### Collection-level checks (C1–C5) @@ -459,15 +500,20 @@ Compares the latest metrics against the thresholds declared in ``` Metric Value Threshold Status --------------------------------------------------------------- -redundancy_ratio 0.0059 max=0.1 PASS +redundancy_ratio 0.0061 max=0.1 PASS coverage_ratio 0.6190 min=0.4 PASS coherence_components 0.0000 max=3 PASS consistency_cycles 0.0000 max=0 PASS -granularity_entropy 2.9533 min=1.0 PASS +granularity_entropy 2.6748 min=1.0 PASS +per_entity_mean 4.XXXX min=3.5 PASS -Viable: YES (5/5 thresholds met) +Viable: YES (6/6 thresholds met) ``` +`per_entity_mean` only appears after running `eval-summary --update-metrics`. +Run `check` first (deterministic), then `eval-summary --update-metrics`, +then `viability` to see the full six-threshold dashboard. + During early processing (first few books), coverage will fall and then stabilise as the domain × chapter matrix fills in. The threshold of 0.40 reflects realistic expectations for a multi-book corpus where