feat(infospace,llm): stabilize free-tier eval workflow

Five improvements that eliminate most of the agent-in-the-loop friction observed while closing out the 988-entity WoN evaluation (C.1): 1. Gemini adapter now retries on 429 + 5xx with exponential backoff (same pattern already used by OpenRouter/OpenAI). Removes the need for shell-level retry wrappers when hitting free-tier rate limits. 2. evaluate CLI prints the underlying error ("ERROR — HTTP 503 …") instead of a bare "ERROR", so agents don't have to drop into Python to diagnose transient failures. 3. --entity/--chapter now respect existing evaluation files by default (previously only the full-collection pass did). New --force flag opts into re-evaluation. Stops silently burning free-tier quota on re-runs of the same slug. 4. --entity accepts hyphenated slugs (matching entity filenames) and normalizes them to the underscore form used on disk. On a miss the CLI suggests near matches instead of a bare "not found". 5. eval-summary --update-metrics is no longer destructive: read_metrics_file/write_metrics_file preserve structured values (type_distribution) and don't flatten ints to floats. Fixes a silent data loss observed on every run. Bonus: the evaluator field in written evaluation frontmatter now falls back from run_config.model_name to the adapter's resolved model (or the model echoed back in the API response), so rows no longer show `evaluator: null` when --model is omitted. Tests: new tests/unit/llm/test_gemini.py covers retry behavior; tests/unit/infospace/test_history.py gains a round-trip test that pins the type_distribution / int-preservation invariants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 00:51:00 +02:00
parent 965508ec06
commit c0615c2d50
6 changed files with 210 additions and 27 deletions
--- a/markitect/infospace/history.py
+++ b/markitect/infospace/history.py
@@ -81,17 +81,26 @@ def snapshot_from_checks(
 # ── Metrics file I/O ────────────────────────────────────────────────


-def write_metrics_file(metrics: Dict[str, float], path: Path) -> None:
+def write_metrics_file(metrics: Dict[str, Any], path: Path) -> None:
    """Write the latest metrics to a simple YAML file.

    This file is used by ``markitect infospace viability`` for quick
-    threshold checking.
+    threshold checking. Non-numeric values (e.g. ``type_distribution``)
+    are passed through unchanged; floats are rounded to 6 dp; ints are
+    preserved as ints so external consumers don't see ``29`` silently
+    become ``29.0`` on every round-trip.
    """
+    def _normalize(v: Any) -> Any:
+        if isinstance(v, bool):
+            return v
+        if isinstance(v, float):
+            return round(v, 6)
+        return v
+
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(
        yaml.safe_dump(
-            {k: round(v, 6) if isinstance(v, float) else v
-             for k, v in sorted(metrics.items())},
+            {k: _normalize(v) for k, v in sorted(metrics.items())},
            default_flow_style=False,
            sort_keys=True,
        ),
@@ -99,14 +108,20 @@ def write_metrics_file(metrics: Dict[str, float], path: Path) -> None:
    )


-def read_metrics_file(path: Path) -> Dict[str, float]:
-    """Read the latest metrics from a YAML file."""
+def read_metrics_file(path: Path) -> Dict[str, Any]:
+    """Read the latest metrics from a YAML file.
+
+    Returns all keys as written on disk, preserving types verbatim so a
+    round-trip via :func:`write_metrics_file` does not silently drop
+    structured values (e.g. ``type_distribution``) or flatten ints to
+    floats.
+    """
    if not path.is_file():
        return {}
    raw = yaml.safe_load(path.read_text(encoding="utf-8"))
    if not isinstance(raw, dict):
        return {}
-    return {k: float(v) for k, v in raw.items() if isinstance(v, (int, float))}
+    return raw


 # ── History operations ───────────────────────────────────────────────