Add adaptive cost-quality routing primitives

2026-05-17 21:32:27 +02:00
parent bf86a03c5d
commit c4ad4bb9f2
17 changed files with 2480 additions and 25 deletions
--- a/contracts/functional/quality-ledger.md
+++ b/contracts/functional/quality-ledger.md
@@ -0,0 +1,87 @@
+# Contract: QualityObservation and QualityLedger
+
+**layer:** Functional
+**maturity:** Beta
+**module:** `llm_connect.quality`
+**since:** WP-0004
+
+## Purpose
+
+Record observed quality, cost, latency, and token outcomes for a logical task
+type so consumers can build adaptive routing policy without putting
+consumer-specific thresholds into llm-connect.
+
+## Public surface
+
+```python
+@dataclass(frozen=True)
+class QualityObservation:
+    task_type: str
+    adapter_id: str
+    model_id: str
+    cost_usd: float
+    quality_score: float
+    latency_ms: float
+    tokens_in: int
+    tokens_out: int
+    baseline_adapter_id: str | None = None
+    recorded_at: datetime = field(default_factory=...)
+    tags: dict[str, Any] = field(default_factory=dict)
+
+    @property
+    def total_tokens(self) -> int: ...
+    def to_dict(self) -> dict[str, Any]: ...
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
+
+class QualityLedger:
+    def __init__(self, path: str | Path): ...
+    @property
+    def path(self) -> Path: ...
+    def append(self, observation: QualityObservation) -> None: ...
+    def read_all(self) -> list[QualityObservation]: ...
+    def malformed_count(self) -> int: ...
+    def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
+    def recent(...) -> list[QualityObservation]: ...
+    def mean_quality(...) -> float | None: ...
+    def prune_before(self, timestamp: datetime) -> int: ...
+
+def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
+```
+
+## Invariants
+
+1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
+   candidate fully meets the grader's quality bar and `0.0` means complete
+   failure for that grader.
+2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
+3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
+4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
+5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
+6. `QualityLedger.append()` performs a process-local lock plus an advisory file
+   lock around each write.
+7. Read/query helpers skip malformed lines instead of failing the whole ledger.
+   `malformed_count()` exposes how many lines were skipped.
+8. `prune_before()` removes only valid observations older than the cutoff.
+   Malformed lines are preserved.
+
+## Error contract
+
+| Condition | Exception |
+|-----------|-----------|
+| Invalid observation field | `ValueError` |
+| Invalid datetime field | `TypeError` or `ValueError` |
+| Negative recent limit | `ValueError` |
+| `mean_quality(min_observations <= 0)` | `ValueError` |
+| `is_stale(max_age < 0)` | `ValueError` |
+
+## Known consumers
+
+- `infospace-bench` is the first intended consumer. It is expected to provide
+  task taxonomy, thresholds, and baseline choice.
+
+## Notes
+
+The ledger intentionally stores only observation metadata in this slice. Callers
+that need prompt or response digests can place those in `tags`, for example
+`prompt_fingerprint`.