Files
llm-connect/contracts/functional/quality-ledger.md
tegwick c4ad4bb9f2
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add adaptive cost-quality routing primitives
2026-05-17 21:32:27 +02:00

88 lines
3.0 KiB
Markdown

# Contract: QualityObservation and QualityLedger
**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.quality`
**since:** WP-0004
## Purpose
Record observed quality, cost, latency, and token outcomes for a logical task
type so consumers can build adaptive routing policy without putting
consumer-specific thresholds into llm-connect.
## Public surface
```python
@dataclass(frozen=True)
class QualityObservation:
task_type: str
adapter_id: str
model_id: str
cost_usd: float
quality_score: float
latency_ms: float
tokens_in: int
tokens_out: int
baseline_adapter_id: str | None = None
recorded_at: datetime = field(default_factory=...)
tags: dict[str, Any] = field(default_factory=dict)
@property
def total_tokens(self) -> int: ...
def to_dict(self) -> dict[str, Any]: ...
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
class QualityLedger:
def __init__(self, path: str | Path): ...
@property
def path(self) -> Path: ...
def append(self, observation: QualityObservation) -> None: ...
def read_all(self) -> list[QualityObservation]: ...
def malformed_count(self) -> int: ...
def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
def recent(...) -> list[QualityObservation]: ...
def mean_quality(...) -> float | None: ...
def prune_before(self, timestamp: datetime) -> int: ...
def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
```
## Invariants
1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
candidate fully meets the grader's quality bar and `0.0` means complete
failure for that grader.
2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
6. `QualityLedger.append()` performs a process-local lock plus an advisory file
lock around each write.
7. Read/query helpers skip malformed lines instead of failing the whole ledger.
`malformed_count()` exposes how many lines were skipped.
8. `prune_before()` removes only valid observations older than the cutoff.
Malformed lines are preserved.
## Error contract
| Condition | Exception |
|-----------|-----------|
| Invalid observation field | `ValueError` |
| Invalid datetime field | `TypeError` or `ValueError` |
| Negative recent limit | `ValueError` |
| `mean_quality(min_observations <= 0)` | `ValueError` |
| `is_stale(max_age < 0)` | `ValueError` |
## Known consumers
- `infospace-bench` is the first intended consumer. It is expected to provide
task taxonomy, thresholds, and baseline choice.
## Notes
The ledger intentionally stores only observation metadata in this slice. Callers
that need prompt or response digests can place those in `tags`, for example
`prompt_fingerprint`.