generated from coulomb/repo-seed
85 lines
3.2 KiB
Markdown
85 lines
3.2 KiB
Markdown
# Contract: ShadowingAdapter
|
|
|
|
**layer:** Functional
|
|
**maturity:** Beta
|
|
**module:** `llm_connect.shadowing`
|
|
**since:** WP-0004
|
|
|
|
## Purpose
|
|
|
|
Collect quality observations without changing caller-visible model behavior.
|
|
`ShadowingAdapter` wraps a candidate adapter, returns the candidate response to
|
|
the caller, and samples extra baseline/grading work that appends
|
|
`QualityObservation` records to a `QualityLedger`.
|
|
|
|
## Public surface
|
|
|
|
```python
|
|
@dataclass
|
|
class ShadowingAdapter(LLMAdapter):
|
|
candidate_adapter: LLMAdapter
|
|
baseline_adapter: LLMAdapter
|
|
grader: BaselineGrader
|
|
ledger: QualityLedger
|
|
task_type: str
|
|
adapter_id: str
|
|
model_id: Optional[str] = None
|
|
baseline_adapter_id: Optional[str] = None
|
|
shadow_rate: float = 1.0
|
|
async_shadow: bool = False
|
|
tags: Mapping[str, Any] = field(default_factory=dict)
|
|
on_shadow_error: Optional[Callable[[Exception], None]] = None
|
|
|
|
def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
|
|
async def async_execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
|
|
def flush(self, timeout: Optional[float] = None) -> None: ...
|
|
def shutdown(self, wait: bool = True) -> None: ...
|
|
```
|
|
|
|
## Invariants
|
|
|
|
1. The candidate adapter is always called first.
|
|
2. The response returned by `execute_prompt()` and `async_execute_prompt()` is
|
|
always the candidate response.
|
|
3. Shadow failures from the baseline adapter, grader, or ledger writer are
|
|
isolated from the caller. They are sent to `on_shadow_error` when configured.
|
|
4. `shadow_rate=0.0` records no observations. `shadow_rate=1.0` shadows every
|
|
successful candidate call. Intermediate values sample with `random_source`.
|
|
5. Shadow grading reuses the candidate response already returned by the wrapped
|
|
candidate adapter; it does not make a second candidate model call.
|
|
6. Shadow calls use a copy of `RunConfig` with `budget_tracker=None`, so
|
|
observation collection cannot consume the caller's foreground token budget.
|
|
7. `async_shadow=True` schedules shadow work on a background thread. `flush()`
|
|
waits for currently queued work, and `shutdown()` releases the executor.
|
|
|
|
## Observation mapping
|
|
|
|
The appended observation uses:
|
|
|
|
- `task_type` from the wrapper configuration
|
|
- `adapter_id` from the wrapper configuration
|
|
- `model_id` from the wrapper configuration, then candidate response model, then
|
|
`RunConfig.model_name`
|
|
- `quality_score` from the `GradingResult`
|
|
- `cost_usd` from response metadata keys `cost_usd`, `estimated_cost_usd`, or
|
|
`cost`, falling back to `0.0`
|
|
- token counts from candidate response usage keys `prompt_tokens` and
|
|
`completion_tokens`
|
|
- `baseline_adapter_id` and `tags` from wrapper configuration
|
|
|
|
## Error contract
|
|
|
|
| Condition | Exception |
|
|
|-----------|-----------|
|
|
| Empty `task_type` | `ValueError` |
|
|
| Empty `adapter_id` | `ValueError` |
|
|
| `shadow_rate` outside `0..1` | `ValueError` |
|
|
| Candidate adapter failure | Original exception propagates |
|
|
| Shadow baseline/grading/ledger failure | Suppressed; optional callback |
|
|
|
|
## Privacy note
|
|
|
|
The wrapper does not store prompt or response text in the ledger by default.
|
|
Callers that need regime tracking should store non-sensitive fingerprints in
|
|
`tags`, for example `prompt_fingerprint` or `template_version`.
|