feat(infospace,llm): stabilize free-tier eval workflow
Some checks failed
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
Some checks failed
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
Five improvements that eliminate most of the agent-in-the-loop friction
observed while closing out the 988-entity WoN evaluation (C.1):
1. Gemini adapter now retries on 429 + 5xx with exponential backoff
(same pattern already used by OpenRouter/OpenAI). Removes the need
for shell-level retry wrappers when hitting free-tier rate limits.
2. evaluate CLI prints the underlying error ("ERROR — HTTP 503 …")
instead of a bare "ERROR", so agents don't have to drop into Python
to diagnose transient failures.
3. --entity/--chapter now respect existing evaluation files by default
(previously only the full-collection pass did). New --force flag
opts into re-evaluation. Stops silently burning free-tier quota on
re-runs of the same slug.
4. --entity accepts hyphenated slugs (matching entity filenames) and
normalizes them to the underscore form used on disk. On a miss the
CLI suggests near matches instead of a bare "not found".
5. eval-summary --update-metrics is no longer destructive:
read_metrics_file/write_metrics_file preserve structured values
(type_distribution) and don't flatten ints to floats. Fixes a
silent data loss observed on every run.
Bonus: the evaluator field in written evaluation frontmatter now
falls back from run_config.model_name to the adapter's resolved model
(or the model echoed back in the API response), so rows no longer
show `evaluator: null` when --model is omitted.
Tests: new tests/unit/llm/test_gemini.py covers retry behavior;
tests/unit/infospace/test_history.py gains a round-trip test that
pins the type_distribution / int-preservation invariants.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -195,12 +195,23 @@ def run_entity_evaluation(
|
||||
"""
|
||||
topic = config.topic.name
|
||||
evaluations_path = output_dir or Path(config.evaluations_dir)
|
||||
evaluator_name = (run_config.model_name if run_config else "unknown")
|
||||
# Fall back from run_config.model_name (may be None if the CLI user did
|
||||
# not pass --model) to the adapter's resolved model, and only then to
|
||||
# "unknown". Keeps the evaluator field in the written frontmatter
|
||||
# informative for later audits.
|
||||
default_evaluator = (
|
||||
(run_config.model_name if run_config else None)
|
||||
or getattr(adapter, "_model", None)
|
||||
or "unknown"
|
||||
)
|
||||
|
||||
def _write_and_notify(done: int, total: int, result) -> None:
|
||||
# Write file immediately on success (incremental — run is resumable)
|
||||
if result.status == "success" and result.response is not None:
|
||||
scores = parse_evaluation_response(result.response.content, dimensions)
|
||||
# Prefer the model name the adapter actually echoed back — it
|
||||
# reflects post-resolution fallbacks (e.g. flash → flash-lite).
|
||||
evaluator_name = result.response.model or default_evaluator
|
||||
evaluation = EntityEvaluation(
|
||||
entity_slug=result.key,
|
||||
evaluator=evaluator_name,
|
||||
|
||||
Reference in New Issue
Block a user