Files
llm-connect/contracts/functional/quality-ledger.md
tegwick c4ad4bb9f2
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add adaptive cost-quality routing primitives
2026-05-17 21:32:27 +02:00

3.0 KiB

Contract: QualityObservation and QualityLedger

layer: Functional maturity: Beta module: llm_connect.quality since: WP-0004

Purpose

Record observed quality, cost, latency, and token outcomes for a logical task type so consumers can build adaptive routing policy without putting consumer-specific thresholds into llm-connect.

Public surface

@dataclass(frozen=True)
class QualityObservation:
    task_type: str
    adapter_id: str
    model_id: str
    cost_usd: float
    quality_score: float
    latency_ms: float
    tokens_in: int
    tokens_out: int
    baseline_adapter_id: str | None = None
    recorded_at: datetime = field(default_factory=...)
    tags: dict[str, Any] = field(default_factory=dict)

    @property
    def total_tokens(self) -> int: ...
    def to_dict(self) -> dict[str, Any]: ...
    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...

class QualityLedger:
    def __init__(self, path: str | Path): ...
    @property
    def path(self) -> Path: ...
    def append(self, observation: QualityObservation) -> None: ...
    def read_all(self) -> list[QualityObservation]: ...
    def malformed_count(self) -> int: ...
    def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
    def recent(...) -> list[QualityObservation]: ...
    def mean_quality(...) -> float | None: ...
    def prune_before(self, timestamp: datetime) -> int: ...

def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...

Invariants

  1. quality_score is a normalised 0.0..1.0 score where 1.0 means the candidate fully meets the grader's quality bar and 0.0 means complete failure for that grader.
  2. task_type, adapter_id, and model_id must be non-empty strings.
  3. cost_usd, latency_ms, tokens_in, and tokens_out are non-negative.
  4. recorded_at is normalised to UTC. Naive datetimes are interpreted as UTC.
  5. Ledger records are JSON Lines. Each line is one QualityObservation.to_dict().
  6. QualityLedger.append() performs a process-local lock plus an advisory file lock around each write.
  7. Read/query helpers skip malformed lines instead of failing the whole ledger. malformed_count() exposes how many lines were skipped.
  8. prune_before() removes only valid observations older than the cutoff. Malformed lines are preserved.

Error contract

Condition Exception
Invalid observation field ValueError
Invalid datetime field TypeError or ValueError
Negative recent limit ValueError
mean_quality(min_observations <= 0) ValueError
is_stale(max_age < 0) ValueError

Known consumers

  • infospace-bench is the first intended consumer. It is expected to provide task taxonomy, thresholds, and baseline choice.

Notes

The ledger intentionally stores only observation metadata in this slice. Callers that need prompt or response digests can place those in tags, for example prompt_fingerprint.