feat(infospace,llm): stabilize free-tier eval workflow

Five improvements that eliminate most of the agent-in-the-loop friction observed while closing out the 988-entity WoN evaluation (C.1): 1. Gemini adapter now retries on 429 + 5xx with exponential backoff (same pattern already used by OpenRouter/OpenAI). Removes the need for shell-level retry wrappers when hitting free-tier rate limits. 2. evaluate CLI prints the underlying error ("ERROR — HTTP 503 …") instead of a bare "ERROR", so agents don't have to drop into Python to diagnose transient failures. 3. --entity/--chapter now respect existing evaluation files by default (previously only the full-collection pass did). New --force flag opts into re-evaluation. Stops silently burning free-tier quota on re-runs of the same slug. 4. --entity accepts hyphenated slugs (matching entity filenames) and normalizes them to the underscore form used on disk. On a miss the CLI suggests near matches instead of a bare "not found". 5. eval-summary --update-metrics is no longer destructive: read_metrics_file/write_metrics_file preserve structured values (type_distribution) and don't flatten ints to floats. Fixes a silent data loss observed on every run. Bonus: the evaluator field in written evaluation frontmatter now falls back from run_config.model_name to the adapter's resolved model (or the model echoed back in the API response), so rows no longer show `evaluator: null` when --model is omitted. Tests: new tests/unit/llm/test_gemini.py covers retry behavior; tests/unit/infospace/test_history.py gains a round-trip test that pins the type_distribution / int-preservation invariants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 00:51:00 +02:00
parent 965508ec06
commit c0615c2d50
6 changed files with 210 additions and 27 deletions
--- a/markitect/infospace/evaluate.py
+++ b/markitect/infospace/evaluate.py
@@ -195,12 +195,23 @@ def run_entity_evaluation(
    """
    topic = config.topic.name
    evaluations_path = output_dir or Path(config.evaluations_dir)
-    evaluator_name = (run_config.model_name if run_config else "unknown")
+    # Fall back from run_config.model_name (may be None if the CLI user did
+    # not pass --model) to the adapter's resolved model, and only then to
+    # "unknown". Keeps the evaluator field in the written frontmatter
+    # informative for later audits.
+    default_evaluator = (
+        (run_config.model_name if run_config else None)
+        or getattr(adapter, "_model", None)
+        or "unknown"
+    )

    def _write_and_notify(done: int, total: int, result) -> None:
        # Write file immediately on success (incremental — run is resumable)
        if result.status == "success" and result.response is not None:
            scores = parse_evaluation_response(result.response.content, dimensions)
+            # Prefer the model name the adapter actually echoed back — it
+            # reflects post-resolution fallbacks (e.g. flash → flash-lite).
+            evaluator_name = result.response.model or default_evaluator
            evaluation = EntityEvaluation(
                entity_slug=result.key,
                evaluator=evaluator_name,