generated from coulomb/repo-seed
88 lines
3.0 KiB
Markdown
88 lines
3.0 KiB
Markdown
# Contract: QualityObservation and QualityLedger
|
|
|
|
**layer:** Functional
|
|
**maturity:** Beta
|
|
**module:** `llm_connect.quality`
|
|
**since:** WP-0004
|
|
|
|
## Purpose
|
|
|
|
Record observed quality, cost, latency, and token outcomes for a logical task
|
|
type so consumers can build adaptive routing policy without putting
|
|
consumer-specific thresholds into llm-connect.
|
|
|
|
## Public surface
|
|
|
|
```python
|
|
@dataclass(frozen=True)
|
|
class QualityObservation:
|
|
task_type: str
|
|
adapter_id: str
|
|
model_id: str
|
|
cost_usd: float
|
|
quality_score: float
|
|
latency_ms: float
|
|
tokens_in: int
|
|
tokens_out: int
|
|
baseline_adapter_id: str | None = None
|
|
recorded_at: datetime = field(default_factory=...)
|
|
tags: dict[str, Any] = field(default_factory=dict)
|
|
|
|
@property
|
|
def total_tokens(self) -> int: ...
|
|
def to_dict(self) -> dict[str, Any]: ...
|
|
@classmethod
|
|
def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
|
|
|
|
class QualityLedger:
|
|
def __init__(self, path: str | Path): ...
|
|
@property
|
|
def path(self) -> Path: ...
|
|
def append(self, observation: QualityObservation) -> None: ...
|
|
def read_all(self) -> list[QualityObservation]: ...
|
|
def malformed_count(self) -> int: ...
|
|
def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
|
|
def recent(...) -> list[QualityObservation]: ...
|
|
def mean_quality(...) -> float | None: ...
|
|
def prune_before(self, timestamp: datetime) -> int: ...
|
|
|
|
def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
|
|
```
|
|
|
|
## Invariants
|
|
|
|
1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
|
|
candidate fully meets the grader's quality bar and `0.0` means complete
|
|
failure for that grader.
|
|
2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
|
|
3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
|
|
4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
|
|
5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
|
|
6. `QualityLedger.append()` performs a process-local lock plus an advisory file
|
|
lock around each write.
|
|
7. Read/query helpers skip malformed lines instead of failing the whole ledger.
|
|
`malformed_count()` exposes how many lines were skipped.
|
|
8. `prune_before()` removes only valid observations older than the cutoff.
|
|
Malformed lines are preserved.
|
|
|
|
## Error contract
|
|
|
|
| Condition | Exception |
|
|
|-----------|-----------|
|
|
| Invalid observation field | `ValueError` |
|
|
| Invalid datetime field | `TypeError` or `ValueError` |
|
|
| Negative recent limit | `ValueError` |
|
|
| `mean_quality(min_observations <= 0)` | `ValueError` |
|
|
| `is_stale(max_age < 0)` | `ValueError` |
|
|
|
|
## Known consumers
|
|
|
|
- `infospace-bench` is the first intended consumer. It is expected to provide
|
|
task taxonomy, thresholds, and baseline choice.
|
|
|
|
## Notes
|
|
|
|
The ledger intentionally stores only observation metadata in this slice. Callers
|
|
that need prompt or response digests can place those in `tags`, for example
|
|
`prompt_fingerprint`.
|