generated from coulomb/repo-seed
3.0 KiB
3.0 KiB
Contract: QualityObservation and QualityLedger
layer: Functional
maturity: Beta
module: llm_connect.quality
since: WP-0004
Purpose
Record observed quality, cost, latency, and token outcomes for a logical task type so consumers can build adaptive routing policy without putting consumer-specific thresholds into llm-connect.
Public surface
@dataclass(frozen=True)
class QualityObservation:
task_type: str
adapter_id: str
model_id: str
cost_usd: float
quality_score: float
latency_ms: float
tokens_in: int
tokens_out: int
baseline_adapter_id: str | None = None
recorded_at: datetime = field(default_factory=...)
tags: dict[str, Any] = field(default_factory=dict)
@property
def total_tokens(self) -> int: ...
def to_dict(self) -> dict[str, Any]: ...
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
class QualityLedger:
def __init__(self, path: str | Path): ...
@property
def path(self) -> Path: ...
def append(self, observation: QualityObservation) -> None: ...
def read_all(self) -> list[QualityObservation]: ...
def malformed_count(self) -> int: ...
def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
def recent(...) -> list[QualityObservation]: ...
def mean_quality(...) -> float | None: ...
def prune_before(self, timestamp: datetime) -> int: ...
def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
Invariants
quality_scoreis a normalised0.0..1.0score where1.0means the candidate fully meets the grader's quality bar and0.0means complete failure for that grader.task_type,adapter_id, andmodel_idmust be non-empty strings.cost_usd,latency_ms,tokens_in, andtokens_outare non-negative.recorded_atis normalised to UTC. Naive datetimes are interpreted as UTC.- Ledger records are JSON Lines. Each line is one
QualityObservation.to_dict(). QualityLedger.append()performs a process-local lock plus an advisory file lock around each write.- Read/query helpers skip malformed lines instead of failing the whole ledger.
malformed_count()exposes how many lines were skipped. prune_before()removes only valid observations older than the cutoff. Malformed lines are preserved.
Error contract
| Condition | Exception |
|---|---|
| Invalid observation field | ValueError |
| Invalid datetime field | TypeError or ValueError |
| Negative recent limit | ValueError |
mean_quality(min_observations <= 0) |
ValueError |
is_stale(max_age < 0) |
ValueError |
Known consumers
infospace-benchis the first intended consumer. It is expected to provide task taxonomy, thresholds, and baseline choice.
Notes
The ledger intentionally stores only observation metadata in this slice. Callers
that need prompt or response digests can place those in tags, for example
prompt_fingerprint.