llm-connect/contracts/functional/quality-ledger.md

# Contract: QualityObservation and QualityLedger

**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.quality`
**since:** WP-0004

## Purpose

Record observed quality, cost, latency, and token outcomes for a logical task
type so consumers can build adaptive routing policy without putting
consumer-specific thresholds into llm-connect.

## Public surface

```python
@dataclass(frozen=True)
class QualityObservation:
    task_type: str
    adapter_id: str
    model_id: str
    cost_usd: float
    quality_score: float
    latency_ms: float
    tokens_in: int
    tokens_out: int
    baseline_adapter_id: str | None = None
    recorded_at: datetime = field(default_factory=...)
    tags: dict[str, Any] = field(default_factory=dict)

    @property
    def total_tokens(self) -> int: ...
    def to_dict(self) -> dict[str, Any]: ...
    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...

class QualityLedger:
    def __init__(self, path: str | Path): ...
    @property
    def path(self) -> Path: ...
    def append(self, observation: QualityObservation) -> None: ...
    def read_all(self) -> list[QualityObservation]: ...
    def malformed_count(self) -> int: ...
    def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
    def recent(...) -> list[QualityObservation]: ...
    def mean_quality(...) -> float | None: ...
    def prune_before(self, timestamp: datetime) -> int: ...

def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
```

## Invariants

1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
   candidate fully meets the grader's quality bar and `0.0` means complete
   failure for that grader.
2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
6. `QualityLedger.append()` performs a process-local lock plus an advisory file
   lock around each write.
7. Read/query helpers skip malformed lines instead of failing the whole ledger.
   `malformed_count()` exposes how many lines were skipped.
8. `prune_before()` removes only valid observations older than the cutoff.
   Malformed lines are preserved.

## Error contract

| Condition | Exception |
|-----------|-----------|
| Invalid observation field | `ValueError` |
| Invalid datetime field | `TypeError` or `ValueError` |
| Negative recent limit | `ValueError` |
| `mean_quality(min_observations <= 0)` | `ValueError` |
| `is_stale(max_age < 0)` | `ValueError` |

## Known consumers

- `infospace-bench` is the first intended consumer. It is expected to provide
  task taxonomy, thresholds, and baseline choice.

## Notes

The ledger intentionally stores only observation metadata in this slice. Callers
that need prompt or response digests can place those in `tags`, for example
`prompt_fingerprint`.