Files
markitect-main/markitect/infospace/evaluate.py
tegwick c0615c2d50
Some checks failed
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
feat(infospace,llm): stabilize free-tier eval workflow
Five improvements that eliminate most of the agent-in-the-loop friction
observed while closing out the 988-entity WoN evaluation (C.1):

1. Gemini adapter now retries on 429 + 5xx with exponential backoff
   (same pattern already used by OpenRouter/OpenAI). Removes the need
   for shell-level retry wrappers when hitting free-tier rate limits.

2. evaluate CLI prints the underlying error ("ERROR — HTTP 503 …")
   instead of a bare "ERROR", so agents don't have to drop into Python
   to diagnose transient failures.

3. --entity/--chapter now respect existing evaluation files by default
   (previously only the full-collection pass did). New --force flag
   opts into re-evaluation. Stops silently burning free-tier quota on
   re-runs of the same slug.

4. --entity accepts hyphenated slugs (matching entity filenames) and
   normalizes them to the underscore form used on disk. On a miss the
   CLI suggests near matches instead of a bare "not found".

5. eval-summary --update-metrics is no longer destructive:
   read_metrics_file/write_metrics_file preserve structured values
   (type_distribution) and don't flatten ints to floats. Fixes a
   silent data loss observed on every run.

Bonus: the evaluator field in written evaluation frontmatter now
falls back from run_config.model_name to the adapter's resolved model
(or the model echoed back in the API response), so rows no longer
show `evaluator: null` when --model is omitted.

Tests: new tests/unit/llm/test_gemini.py covers retry behavior;
tests/unit/infospace/test_history.py gains a round-trip test that
pins the type_distribution / int-preservation invariants.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 00:51:00 +02:00

244 lines
8.2 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
Per-entity evaluation pipeline.
Builds prompts from entity metadata and delegates LLM evaluation to
the :class:`BatchEvaluator`. Writes structured results to the
evaluations directory.
"""
from __future__ import annotations
import hashlib
from datetime import datetime
from pathlib import Path
from typing import Callable, Dict, List, Optional
from markitect.infospace.config import InfospaceConfig
from markitect.infospace.evaluation import EntityEvaluation, ScoreEntry
from markitect.infospace.evaluation_io import write_entity_evaluation
from markitect.infospace.models import EntityMeta
from markitect.prompts.execution.batch import BatchEvaluator, BatchItem, BatchSummary
from markitect.prompts.execution.llm_adapter import LLMAdapter
from markitect.prompts.execution.models import RunConfig
_DEFAULT_DIMENSIONS = [
"definition_precision",
"source_grounding",
"domain_placement",
"vsm_relevance",
"explanatory_value",
]
_PROMPT_TEMPLATE = """\
You are evaluating an entity from an infospace about "{topic}".
## Entity: {title}
**Slug:** {slug}
**Domain:** {domain}
**Source chapter:** {source_chapter}
### Definition
{definition}
### Context
{context}
## Background
This infospace maps concepts from the source corpus to Stafford Beer's
Viable System Model (VSM). The VSM has five systems: S1 (primary operations),
S2 (coordination/anti-oscillation), S3 (internal regulation/audit),
S4 (intelligence/environmental adaptation), S5 (identity/policy). Use this
to assess whether the entity has a natural VSM home.
## Dimensions
- **definition_precision**: Is the definition precise and non-circular? \
Does it capture a distinct concept rather than a vague umbrella term?
- **source_grounding**: Is this entity grounded in the actual source text, \
or does it introduce concepts the source does not clearly state?
- **domain_placement**: Is the economic/thematic domain assignment correct? \
Does the entity belong in a different conceptual category?
- **vsm_relevance**: Does this entity map naturally to one or more VSM \
systems (S1S5), or is it VSM-neutral/too abstract to place?
- **explanatory_value**: Does this entity add genuine explanatory power — \
illuminating a mechanism or structural relation — or does it merely name \
a surface phenomenon?
## Instructions
Rate each dimension 15 (1 = poor, 5 = excellent). Provide a brief
rationale (12 sentences) for each score.
## Output format
DIMENSION: <name>
SCORE: <1-5>
RATIONALE: <explanation>
Repeat for each dimension.
"""
def build_evaluation_prompt(
entity: EntityMeta,
topic: str,
dimensions: Optional[List[str]] = None,
) -> str:
"""Build an evaluation prompt for a single entity."""
dims = dimensions or _DEFAULT_DIMENSIONS
dims_list = "\n".join(f"- {d}" for d in dims)
return _PROMPT_TEMPLATE.format(
topic=topic,
title=entity.title,
slug=entity.slug,
domain=entity.domain or "(unspecified)",
source_chapter=entity.source_chapter or "(unspecified)",
definition=entity.definition or "(no definition)",
context=entity.context or "(no context)",
dimensions_list=dims_list,
)
def content_digest(entity: EntityMeta) -> str:
"""Compute a content digest for incremental evaluation."""
content = f"{entity.slug}:{entity.definition}:{entity.context}:{entity.domain}"
return hashlib.sha256(content.encode()).hexdigest()[:16]
def parse_evaluation_response(
response_text: str,
dimensions: Optional[List[str]] = None,
) -> List[ScoreEntry]:
"""Parse structured dimension scores from LLM response text.
Expects blocks of::
DIMENSION: <name>
SCORE: <1-5>
RATIONALE: <text>
"""
dims = dimensions or _DEFAULT_DIMENSIONS
scores: List[ScoreEntry] = []
current_dim = None
current_score = None
current_rationale = ""
for line in response_text.splitlines():
stripped = line.strip()
if stripped.upper().startswith("DIMENSION:"):
# Flush previous
if current_dim is not None and current_score is not None:
scores.append(ScoreEntry(
name=current_dim,
value=current_score,
max_value=5.0,
rationale=current_rationale.strip(),
))
current_dim = stripped.split(":", 1)[1].strip()
current_score = None
current_rationale = ""
elif stripped.upper().startswith("SCORE:"):
try:
current_score = float(stripped.split(":", 1)[1].strip())
except ValueError:
current_score = None
elif stripped.upper().startswith("RATIONALE:"):
current_rationale = stripped.split(":", 1)[1].strip()
elif current_dim is not None and current_score is not None:
# Continuation of rationale
if stripped:
current_rationale += " " + stripped
# Flush last
if current_dim is not None and current_score is not None:
scores.append(ScoreEntry(
name=current_dim,
value=current_score,
max_value=5.0,
rationale=current_rationale.strip(),
))
return scores
def run_entity_evaluation(
config: InfospaceConfig,
entities: List[EntityMeta],
adapter: LLMAdapter,
run_config: Optional[RunConfig] = None,
output_dir: Optional[Path] = None,
previous_digests: Optional[Dict[str, str]] = None,
progress_callback: Optional[Callable] = None,
dimensions: Optional[List[str]] = None,
) -> BatchSummary:
"""Run per-entity evaluation using the batch evaluator.
Evaluation files are written **incrementally** after each successful
result, so a long run is resumable and safe to interrupt.
Args:
config: The infospace configuration.
entities: Entities to evaluate.
adapter: LLM adapter for evaluation.
run_config: LLM execution configuration.
output_dir: Where to write evaluation results. Defaults to
``config.evaluations_dir`` relative to CWD.
previous_digests: ``{slug: digest}`` for incremental skip.
progress_callback: Called after each item.
dimensions: Custom evaluation dimensions.
Returns:
A :class:`BatchSummary` with per-entity results.
"""
topic = config.topic.name
evaluations_path = output_dir or Path(config.evaluations_dir)
# Fall back from run_config.model_name (may be None if the CLI user did
# not pass --model) to the adapter's resolved model, and only then to
# "unknown". Keeps the evaluator field in the written frontmatter
# informative for later audits.
default_evaluator = (
(run_config.model_name if run_config else None)
or getattr(adapter, "_model", None)
or "unknown"
)
def _write_and_notify(done: int, total: int, result) -> None:
# Write file immediately on success (incremental — run is resumable)
if result.status == "success" and result.response is not None:
scores = parse_evaluation_response(result.response.content, dimensions)
# Prefer the model name the adapter actually echoed back — it
# reflects post-resolution fallbacks (e.g. flash → flash-lite).
evaluator_name = result.response.model or default_evaluator
evaluation = EntityEvaluation(
entity_slug=result.key,
evaluator=evaluator_name,
scores=scores,
evaluated_at=datetime.utcnow(),
)
eval_path = evaluations_path / f"{result.key}.md"
write_entity_evaluation(evaluation, eval_path)
if progress_callback is not None:
progress_callback(done, total, result)
items = [
BatchItem(
key=entity.slug,
prompt=build_evaluation_prompt(entity, topic, dimensions),
content_digest=content_digest(entity),
metadata={"source_path": entity.source_path},
)
for entity in entities
]
evaluator = BatchEvaluator(
adapter=adapter,
config=run_config,
progress_callback=_write_and_notify,
previous_digests=previous_digests,
)
return evaluator.evaluate(items)