# Contract: QualityObservation and QualityLedger **layer:** Functional **maturity:** Beta **module:** `llm_connect.quality` **since:** WP-0004 ## Purpose Record observed quality, cost, latency, and token outcomes for a logical task type so consumers can build adaptive routing policy without putting consumer-specific thresholds into llm-connect. ## Public surface ```python @dataclass(frozen=True) class QualityObservation: task_type: str adapter_id: str model_id: str cost_usd: float quality_score: float latency_ms: float tokens_in: int tokens_out: int baseline_adapter_id: str | None = None recorded_at: datetime = field(default_factory=...) tags: dict[str, Any] = field(default_factory=dict) @property def total_tokens(self) -> int: ... def to_dict(self) -> dict[str, Any]: ... @classmethod def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ... class QualityLedger: def __init__(self, path: str | Path): ... @property def path(self) -> Path: ... def append(self, observation: QualityObservation) -> None: ... def read_all(self) -> list[QualityObservation]: ... def malformed_count(self) -> int: ... def by_task_type(self, task_type: str) -> list[QualityObservation]: ... def recent(...) -> list[QualityObservation]: ... def mean_quality(...) -> float | None: ... def prune_before(self, timestamp: datetime) -> int: ... def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ... ``` ## Invariants 1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the candidate fully meets the grader's quality bar and `0.0` means complete failure for that grader. 2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings. 3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative. 4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC. 5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`. 6. `QualityLedger.append()` performs a process-local lock plus an advisory file lock around each write. 7. Read/query helpers skip malformed lines instead of failing the whole ledger. `malformed_count()` exposes how many lines were skipped. 8. `prune_before()` removes only valid observations older than the cutoff. Malformed lines are preserved. ## Error contract | Condition | Exception | |-----------|-----------| | Invalid observation field | `ValueError` | | Invalid datetime field | `TypeError` or `ValueError` | | Negative recent limit | `ValueError` | | `mean_quality(min_observations <= 0)` | `ValueError` | | `is_stale(max_age < 0)` | `ValueError` | ## Known consumers - `infospace-bench` is the first intended consumer. It is expected to provide task taxonomy, thresholds, and baseline choice. ## Notes The ledger intentionally stores only observation metadata in this slice. Callers that need prompt or response digests can place those in `tags`, for example `prompt_fingerprint`.