feat(infospace,llm): stabilize free-tier eval workflow
Some checks failed
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
Some checks failed
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
Five improvements that eliminate most of the agent-in-the-loop friction
observed while closing out the 988-entity WoN evaluation (C.1):
1. Gemini adapter now retries on 429 + 5xx with exponential backoff
(same pattern already used by OpenRouter/OpenAI). Removes the need
for shell-level retry wrappers when hitting free-tier rate limits.
2. evaluate CLI prints the underlying error ("ERROR — HTTP 503 …")
instead of a bare "ERROR", so agents don't have to drop into Python
to diagnose transient failures.
3. --entity/--chapter now respect existing evaluation files by default
(previously only the full-collection pass did). New --force flag
opts into re-evaluation. Stops silently burning free-tier quota on
re-runs of the same slug.
4. --entity accepts hyphenated slugs (matching entity filenames) and
normalizes them to the underscore form used on disk. On a miss the
CLI suggests near matches instead of a bare "not found".
5. eval-summary --update-metrics is no longer destructive:
read_metrics_file/write_metrics_file preserve structured values
(type_distribution) and don't flatten ints to floats. Fixes a
silent data loss observed on every run.
Bonus: the evaluator field in written evaluation frontmatter now
falls back from run_config.model_name to the adapter's resolved model
(or the model echoed back in the API response), so rows no longer
show `evaluator: null` when --model is omitted.
Tests: new tests/unit/llm/test_gemini.py covers retry behavior;
tests/unit/infospace/test_history.py gains a round-trip test that
pins the type_distribution / int-preservation invariants.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -81,17 +81,26 @@ def snapshot_from_checks(
|
||||
# ── Metrics file I/O ────────────────────────────────────────────────
|
||||
|
||||
|
||||
def write_metrics_file(metrics: Dict[str, float], path: Path) -> None:
|
||||
def write_metrics_file(metrics: Dict[str, Any], path: Path) -> None:
|
||||
"""Write the latest metrics to a simple YAML file.
|
||||
|
||||
This file is used by ``markitect infospace viability`` for quick
|
||||
threshold checking.
|
||||
threshold checking. Non-numeric values (e.g. ``type_distribution``)
|
||||
are passed through unchanged; floats are rounded to 6 dp; ints are
|
||||
preserved as ints so external consumers don't see ``29`` silently
|
||||
become ``29.0`` on every round-trip.
|
||||
"""
|
||||
def _normalize(v: Any) -> Any:
|
||||
if isinstance(v, bool):
|
||||
return v
|
||||
if isinstance(v, float):
|
||||
return round(v, 6)
|
||||
return v
|
||||
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(
|
||||
yaml.safe_dump(
|
||||
{k: round(v, 6) if isinstance(v, float) else v
|
||||
for k, v in sorted(metrics.items())},
|
||||
{k: _normalize(v) for k, v in sorted(metrics.items())},
|
||||
default_flow_style=False,
|
||||
sort_keys=True,
|
||||
),
|
||||
@@ -99,14 +108,20 @@ def write_metrics_file(metrics: Dict[str, float], path: Path) -> None:
|
||||
)
|
||||
|
||||
|
||||
def read_metrics_file(path: Path) -> Dict[str, float]:
|
||||
"""Read the latest metrics from a YAML file."""
|
||||
def read_metrics_file(path: Path) -> Dict[str, Any]:
|
||||
"""Read the latest metrics from a YAML file.
|
||||
|
||||
Returns all keys as written on disk, preserving types verbatim so a
|
||||
round-trip via :func:`write_metrics_file` does not silently drop
|
||||
structured values (e.g. ``type_distribution``) or flatten ints to
|
||||
floats.
|
||||
"""
|
||||
if not path.is_file():
|
||||
return {}
|
||||
raw = yaml.safe_load(path.read_text(encoding="utf-8"))
|
||||
if not isinstance(raw, dict):
|
||||
return {}
|
||||
return {k: float(v) for k, v in raw.items() if isinstance(v, (int, float))}
|
||||
return raw
|
||||
|
||||
|
||||
# ── History operations ───────────────────────────────────────────────
|
||||
|
||||
Reference in New Issue
Block a user