markitect-main

Author	SHA1	Message	Date
tegwick	c0615c2d50	feat(infospace,llm): stabilize free-tier eval workflow Some checks failed Test Suite / code-quality (push) Has been cancelled Details Test Suite / security-scan (push) Has been cancelled Details Test Suite / unit-tests (3.11) (push) Has been cancelled Details Test Suite / unit-tests (3.12) (push) Has been cancelled Details Test Suite / integration-tests (push) Has been cancelled Details Test Suite / e2e-tests (push) Has been cancelled Details Test Suite / performance-tests (push) Has been cancelled Details Test Suite / test-summary (push) Has been cancelled Details Five improvements that eliminate most of the agent-in-the-loop friction observed while closing out the 988-entity WoN evaluation (C.1): 1. Gemini adapter now retries on 429 + 5xx with exponential backoff (same pattern already used by OpenRouter/OpenAI). Removes the need for shell-level retry wrappers when hitting free-tier rate limits. 2. evaluate CLI prints the underlying error ("ERROR — HTTP 503 …") instead of a bare "ERROR", so agents don't have to drop into Python to diagnose transient failures. 3. --entity/--chapter now respect existing evaluation files by default (previously only the full-collection pass did). New --force flag opts into re-evaluation. Stops silently burning free-tier quota on re-runs of the same slug. 4. --entity accepts hyphenated slugs (matching entity filenames) and normalizes them to the underscore form used on disk. On a miss the CLI suggests near matches instead of a bare "not found". 5. eval-summary --update-metrics is no longer destructive: read_metrics_file/write_metrics_file preserve structured values (type_distribution) and don't flatten ints to floats. Fixes a silent data loss observed on every run. Bonus: the evaluator field in written evaluation frontmatter now falls back from run_config.model_name to the adapter's resolved model (or the model echoed back in the API response), so rows no longer show `evaluator: null` when --model is omitted. Tests: new tests/unit/llm/test_gemini.py covers retry behavior; tests/unit/infospace/test_history.py gains a round-trip test that pins the type_distribution / int-preservation invariants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 00:51:00 +02:00
tegwick	d1f57272a4	feat(example): add L2 classifications for 823/988 WoN entities (S3.4) Batch classification via OpenRouter (claude-sonnet-4). 165 entities remain unclassified due to credit exhaustion; incremental skip means a follow-up run will complete them automatically. Type × VSM matrix (823 entities): S1 S2 S3 S3* S4 S5 Element 86 75 58 21 43 32 (315 total, 38%) Process 39 42 37 17 67 24 (226 total, 28%) Institution 4 12 30 24 . 52 (122 total, 15%) Principle 3 7 15 2 43 32 (102 total, 12%) Relation 2 14 5 5 22 10 (58 total, 7%) Matrix fill: 29/30 cells (Institution/S4 empty — expected) Metrics updated: type_entropy=2.0936, vsm_type_matrix_cells=29 Also: - BatchEvaluator gains delay_seconds param for rate-limited providers - classify CLI gains --rpm option (--rpm 10 for Gemini free tier) - history.write_metrics_file now handles non-float metric values (type_distribution is a dict, was crashing round()) - run_entity_classification forwards delay_seconds to BatchEvaluator - classify-links and graph commands added by user (entities --by-type, graph --format mermaid/dot, classify-links for Relation enrichment) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-23 12:49:11 +01:00
tegwick	7f1eecbdb2	feat(infospace): add eval-summary command and improve evaluate pipeline (S3.3) - Fix evaluate dimensions to match template file: definition_precision, source_grounding, domain_placement, vsm_relevance, explanatory_value (was domain_relevance, discipline_alignment, conceptual_clarity) - Add VSM background context to evaluation prompt so LLM can score vsm_relevance without macro injection - Fix model_name bug: was sending literal "default" to API (HTTP 400) - Refactor run_entity_evaluation to write files incrementally via callback rather than all at once after the batch — long runs are now resumable if interrupted - Add incremental skip in CLI: entities with existing eval files are skipped automatically on re-run (acts as resume) - Add eval-summary command: reads all eval files, shows per-dimension means, optionally writes per_entity_mean to metrics.yaml - Fix record_check_results to merge rather than overwrite metrics.yaml so per_entity_mean survives subsequent check runs - Add per_entity_mean viability threshold (min: 3.5) to infospace.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-23 01:26:45 +01:00
tegwick	ce7f78d57d	feat(infospace): add metrics history and viability tracking (S2.5) History module with snapshot creation from check results, metrics file I/O, auto-append to history after checks, date-based snapshot lookup, and metric trend extraction. CLI commands: history, history-diff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 02:01:00 +01:00

4 Commits