generated from coulomb/repo-seed
Compare commits
4 Commits
0a83e908ce
...
c11a942bb7
| Author | SHA1 | Date | |
|---|---|---|---|
| c11a942bb7 | |||
| 706ace3661 | |||
| a95322051f | |||
| f818acfc62 |
20
.claude/rules/agents.md
Normal file
20
.claude/rules/agents.md
Normal file
@@ -0,0 +1,20 @@
|
||||
## Kaizen Agents
|
||||
|
||||
Specialized agent personas available on demand via the state-hub MCP.
|
||||
|
||||
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
|
||||
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
|
||||
|
||||
Common agents:
|
||||
|
||||
| Agent | Category | When to use |
|
||||
|-------|----------|-------------|
|
||||
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
|
||||
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
|
||||
| `test-maintenance` | testing | Diagnose and fix failing tests |
|
||||
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
|
||||
| `keepaTodofile` | process | Maintain TODO.md during work |
|
||||
| `project-management` | process | Track status, determine next steps |
|
||||
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
|
||||
|
||||
All 17 agents: call `list_kaizen_agents()` for the full list.
|
||||
@@ -1,20 +1,8 @@
|
||||
# Architecture Notes
|
||||
## Architecture
|
||||
|
||||
The intended architecture is layered:
|
||||
<!-- TODO: Describe the key design decisions and component structure.
|
||||
Key modules, data flows, external integrations, state machines, etc. -->
|
||||
|
||||
```text
|
||||
markitect-tool -> syntax layer
|
||||
kontextual-engine -> system/runtime layer
|
||||
infospace-bench -> application layer
|
||||
```
|
||||
## Quick Reference
|
||||
|
||||
The first implementation should establish repo shape before service shape:
|
||||
|
||||
- `infospaces/` for concrete infospace projects
|
||||
- `schemas/` or dependency references for artifact schemas
|
||||
- `workflows/` for application-level workflow definitions
|
||||
- `reports/` for evaluation and inspection outputs
|
||||
- `docs/` for migration and design records
|
||||
|
||||
Use direct dependencies on lower-layer projects only where they clarify the
|
||||
boundary. Avoid copying infrastructure wholesale from `markitect-main`.
|
||||
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
|
||||
|
||||
38
.claude/rules/first-session.md
Normal file
38
.claude/rules/first-session.md
Normal file
@@ -0,0 +1,38 @@
|
||||
## First Session Protocol
|
||||
|
||||
Triggered when `get_domain_summary("markitect")` shows **no workstreams**.
|
||||
The project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Read, don't write**
|
||||
- `~/the-custodian/canon/projects/markitect/project_charter_v0.1.md` — purpose, scope
|
||||
- `~/the-custodian/canon/projects/markitect/roadmap_v0.1.md` — planned phases
|
||||
- Scan repo root: README, directory structure, existing code or docs
|
||||
|
||||
**Step 2 — Survey in-progress work**
|
||||
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
|
||||
|
||||
**Step 3 — Propose workstreams to Bernd**
|
||||
Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
|
||||
roadmap phase. **Wait for approval before creating.**
|
||||
|
||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||
```
|
||||
workplans/infospace-bench-WP-NNNN-<slug>.md ← write this first
|
||||
```
|
||||
Then register in the hub:
|
||||
```
|
||||
create_workstream(topic_id="5571d954-0d30-4950-980d-7bcaaad8e3e2", title="...", owner="...", description="...")
|
||||
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
```
|
||||
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured markitect into N workstreams, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="5571d954-0d30-4950-980d-7bcaaad8e3e2",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
)
|
||||
```
|
||||
|
||||
<!-- Delete or archive this file once past first session -->
|
||||
@@ -1,19 +1,8 @@
|
||||
# Repo Boundary
|
||||
## Repo boundary
|
||||
|
||||
`infospace-bench` owns application-level infospace usage. It must not absorb
|
||||
lower-layer responsibilities.
|
||||
This repo owns **infospace-bench** only. It does not own:
|
||||
|
||||
Belongs here:
|
||||
|
||||
- Infospace definitions and examples
|
||||
- Application workflow definitions
|
||||
- Evaluation and inspection reports
|
||||
- Migration notes from `markitect-main`
|
||||
- Workplans for applied infospace capabilities
|
||||
|
||||
Belongs elsewhere:
|
||||
|
||||
- Markdown parsing and structural syntax primitives: `markitect-tool`
|
||||
- Runtime persistence and orchestration: `kontextual-engine`
|
||||
- LLM provider abstraction: `llm-connect` or equivalent
|
||||
- Final production domain artifacts: the relevant domain repo
|
||||
<!-- TODO: List what belongs in adjacent repos, e.g.:
|
||||
- SSH key management → railiance-infra/
|
||||
- State hub code → state-hub/
|
||||
-->
|
||||
|
||||
@@ -1,11 +1,5 @@
|
||||
# Repo Identity
|
||||
**Purpose:** Application-layer workspace and service for concrete structured knowledge spaces; scoped successor to markitect-main infospace work.
|
||||
|
||||
- Project: `infospace-bench`
|
||||
- Domain: `markitect`
|
||||
- State Hub repo slug: `infospace-bench`
|
||||
- State Hub topic ID: `5571d954-0d30-4950-980d-7bcaaad8e3e2`
|
||||
- Purpose: application-layer workspace and service for concrete infospaces.
|
||||
|
||||
This repo is a scoped successor to the application-level infospace work in
|
||||
`markitect-main`. It should preserve and extend the parts that help create,
|
||||
evaluate, inspect, and evolve real knowledge spaces.
|
||||
**Domain:** markitect
|
||||
**Repo slug:** infospace-bench
|
||||
**Topic ID:** 5571d954-0d30-4950-980d-7bcaaad8e3e2
|
||||
|
||||
@@ -1,8 +1,84 @@
|
||||
# Session Protocol
|
||||
## Session Protocol
|
||||
|
||||
1. Read `SCOPE.md`, `INTENT.md`, and the active workplan before making changes.
|
||||
2. Check `git status --short` and preserve user changes.
|
||||
3. Use State Hub as the coordination record when available.
|
||||
4. Keep repo artifacts traceable: workplans, docs, configs, metrics, and outputs
|
||||
should explain what changed and why.
|
||||
5. Prefer narrow, inspectable changes over broad platform work.
|
||||
State Hub: http://127.0.0.1:8000
|
||||
|
||||
**Step 1 — Orient**
|
||||
|
||||
Read the offline-safe brief first — it works without a live hub connection:
|
||||
```bash
|
||||
cat .custodian-brief.md
|
||||
```
|
||||
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
||||
```
|
||||
get_domain_summary("markitect")
|
||||
```
|
||||
If MCP tools are unavailable in the current agent session, use the REST API:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
|
||||
```
|
||||
If the hub is offline: `cd ~/state-hub && make api`
|
||||
|
||||
**Step 2 — Check inbox**
|
||||
With MCP tools:
|
||||
```
|
||||
get_messages(to_agent="infospace-bench", unread_only=True)
|
||||
```
|
||||
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
|
||||
requests before proceeding.
|
||||
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=infospace-bench&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
**Step 3 — Scan workplans**
|
||||
```bash
|
||||
ls workplans/
|
||||
```
|
||||
For each file with `status: ready`, `active`, or `blocked`, note pending
|
||||
`todo`/`in_progress` tasks.
|
||||
|
||||
**Step 4 — Present brief**
|
||||
|
||||
1. **Active workstreams** for `markitect` — title, task counts, blocking decisions
|
||||
2. **Pending tasks** from `workplans/` + any `[repo:infospace-bench]` hub tasks
|
||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||
- `alignment_warnings`: flag if active work is not aligned with current goal
|
||||
4. **Suggested next action** — highest-priority open item
|
||||
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
|
||||
|
||||
If no workstreams: follow First Session Protocol (`first-session.md`).
|
||||
|
||||
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
|
||||
|
||||
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
|
||||
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
|
||||
|
||||
**Session close:**
|
||||
With MCP tools:
|
||||
```
|
||||
add_progress_event(summary="...", topic_id="5571d954-0d30-4950-980d-7bcaaad8e3e2", workstream_id="<uuid>")
|
||||
```
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"topic_id":"5571d954-0d30-4950-980d-7bcaaad8e3e2","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
|
||||
```
|
||||
If workplan files were modified, ensure the local copy is up to date first:
|
||||
```bash
|
||||
git -C <repo_path> pull --ff-only
|
||||
cd ~/state-hub && make fix-consistency REPO=infospace-bench
|
||||
```
|
||||
For repos where implementation runs on a remote machine (e.g. CoulombCore),
|
||||
use the combined target which pulls before fixing:
|
||||
```bash
|
||||
cd ~/state-hub && make fix-consistency-remote REPO=infospace-bench
|
||||
```
|
||||
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
|
||||
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
|
||||
until you pull — intentional to prevent clobbering remote progress.
|
||||
|
||||
@@ -1,26 +1,19 @@
|
||||
# Stack And Commands
|
||||
## Stack
|
||||
|
||||
The implementation stack is not established yet. Until it is, prefer
|
||||
documentation and small scaffold changes over choosing frameworks prematurely.
|
||||
<!-- TODO: Fill in language, frameworks, and key dependencies -->
|
||||
- **Language:**
|
||||
- **Key deps:**
|
||||
|
||||
The Python package depends on path deps (`markitect-tool`, `artifactstore`)
|
||||
that bring heavy runtime dependencies. Use the Makefile to provision a
|
||||
local venv before running tests:
|
||||
## Dev Commands
|
||||
|
||||
```bash
|
||||
make install # creates ./.venv with all path deps
|
||||
make test # full pytest suite (must run via .venv/bin/python)
|
||||
```
|
||||
|
||||
Useful commands:
|
||||
|
||||
```bash
|
||||
git status --short
|
||||
rg --files
|
||||
```
|
||||
|
||||
State Hub registration was completed with:
|
||||
|
||||
```bash
|
||||
/home/worsch/the-custodian/state-hub/.venv/bin/custodian register-project --domain markitect --path /home/worsch/infospace-bench
|
||||
# TODO: Fill in the standard commands for this repo
|
||||
|
||||
# Install dependencies
|
||||
|
||||
# Run tests
|
||||
|
||||
# Lint / type check
|
||||
|
||||
# Build / package (if applicable)
|
||||
```
|
||||
|
||||
@@ -1,9 +1,28 @@
|
||||
# Workplan Convention
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
- Workplans live in `workplans/`.
|
||||
- Prefix new workplans with `IB-WP-`.
|
||||
- Use YAML frontmatter with `id`, `type`, `title`, `domain`, `repo`, `status`,
|
||||
`owner`, `topic_slug`, `created`, and `updated`.
|
||||
- Include task blocks with stable IDs, status, priority, and optional State Hub
|
||||
task IDs.
|
||||
- Keep workplans tied to this repo's PRD/FRS requirements and State Hub context.
|
||||
File location: `workplans/infospace-bench-WP-NNNN-<slug>.md`
|
||||
ID prefix: `INFOSPACE-WP`
|
||||
|
||||
Work items originate as files in this repo **before** being registered in the hub.
|
||||
|
||||
Canonical workplan/workstream frontmatter statuses are:
|
||||
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
|
||||
Use `proposed` for a newly drafted plan, `ready` after review against current
|
||||
repo state, and `finished` when implementation is complete. `stalled` and
|
||||
`needs_review` are derived health labels, not stored statuses.
|
||||
|
||||
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
||||
prefix: `YYMMDD-infospace-bench-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||
unchanged; the prefix is only for quick visual reference.
|
||||
|
||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
|
||||
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
|
||||
directly. Promote anything requiring analysis, design, approval, dependencies, or
|
||||
multiple planned phases into a normal workplan.
|
||||
|
||||
Ecosystem todos from other agents arrive as `[repo:infospace-bench]` hub tasks —
|
||||
visible at session start. Pick one up by creating the workplan file, then registering
|
||||
the workstream.
|
||||
|
||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
|
||||
162
AGENTS.md
Normal file
162
AGENTS.md
Normal file
@@ -0,0 +1,162 @@
|
||||
# infospace-bench — Agent Instructions
|
||||
|
||||
## Repo Identity
|
||||
|
||||
**Purpose:** Application-layer workspace and service for concrete structured knowledge spaces; scoped successor to markitect-main infospace work.
|
||||
|
||||
**Domain:** markitect
|
||||
**Repo slug:** infospace-bench
|
||||
**Topic ID:** `5571d954-0d30-4950-980d-7bcaaad8e3e2`
|
||||
**Workplan prefix:** `INFOSPACE-WP-`
|
||||
|
||||
---
|
||||
|
||||
## State Hub Integration
|
||||
|
||||
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
|
||||
there is no MCP server for Codex agents.
|
||||
|
||||
| Context | URL |
|
||||
|---------|-----|
|
||||
| Local workstation | `http://127.0.0.1:8000` |
|
||||
| Remote via tunnel | `http://127.0.0.1:18000` |
|
||||
|
||||
### Orient at session start
|
||||
|
||||
```bash
|
||||
# Offline brief — works without hub connection
|
||||
cat .custodian-brief.md
|
||||
|
||||
# Active workstreams for this domain
|
||||
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=5571d954-0d30-4950-980d-7bcaaad8e3e2&status=active" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Check inbox
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=infospace-bench&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Mark a message read:
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
### Log progress (required at session close)
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"summary": "what was done",
|
||||
"event_type": "note",
|
||||
"author": "codex",
|
||||
"workstream_id": "<uuid>",
|
||||
"task_id": "<uuid>"
|
||||
}'
|
||||
```
|
||||
|
||||
Omit `workstream_id` / `task_id` when not applicable.
|
||||
|
||||
### Update task status
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"status": "in_progress"}'
|
||||
# values: todo | in_progress | done | blocked
|
||||
```
|
||||
|
||||
### Flag a task for human review
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"needs_human": true, "intervention_note": "reason"}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session Protocol
|
||||
|
||||
**Start:**
|
||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||
2. Check inbox: `GET /messages/?to_agent=infospace-bench&unread_only=true`; mark read
|
||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||
4. Check blocked tasks: `GET /tasks/?needs_human=true`
|
||||
|
||||
**During work:**
|
||||
- Update task statuses in workplan files as tasks progress
|
||||
- Record significant decisions via `POST /decisions/`
|
||||
|
||||
**Close:**
|
||||
1. Update workplan file task statuses to reflect progress
|
||||
2. Log: `POST /progress/` with a summary of what changed
|
||||
3. Note for the custodian operator: after workplan file changes, run from
|
||||
`~/state-hub`:
|
||||
```bash
|
||||
make fix-consistency REPO=infospace-bench
|
||||
```
|
||||
This syncs task status from files into the hub DB.
|
||||
|
||||
---
|
||||
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
Work items originate as files in this repo — not in the hub. The hub is a
|
||||
read/cache/index layer that rebuilds from files.
|
||||
|
||||
**File location:** `workplans/INFOSPACE-WP-NNNN-<slug>.md`
|
||||
|
||||
**Archived location:** finished workplans may move to
|
||||
`workplans/archived/YYMMDD-INFOSPACE-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
|
||||
the completion/archive date; the frontmatter `id` does not change.
|
||||
|
||||
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
|
||||
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
|
||||
this only for low-risk work completed directly; create a normal workplan for
|
||||
anything needing analysis, design, approval, dependencies, or multiple phases.
|
||||
|
||||
**Frontmatter:**
|
||||
|
||||
```yaml
|
||||
---
|
||||
id: INFOSPACE-WP-NNNN
|
||||
type: workplan
|
||||
title: "..."
|
||||
domain: markitect
|
||||
repo: infospace-bench
|
||||
status: proposed | ready | active | blocked | backlog | finished | archived
|
||||
owner: codex
|
||||
topic_slug: ...
|
||||
created: "YYYY-MM-DD"
|
||||
updated: "YYYY-MM-DD"
|
||||
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
---
|
||||
```
|
||||
|
||||
Use `proposed` for a new draft, `ready` after review against current repo
|
||||
state, and `finished` after implementation. `stalled` and `needs_review` are
|
||||
derived health labels, not frontmatter statuses.
|
||||
|
||||
**Task block format** (one per `##` section):
|
||||
|
||||
```
|
||||
## Task Title
|
||||
|
||||
` ` `task
|
||||
id: INFOSPACE-WP-NNNN-T01
|
||||
status: todo | in_progress | done | blocked
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
` ` `
|
||||
|
||||
Task description text.
|
||||
```
|
||||
|
||||
Status progression: `todo` → `in_progress` → `done` (or `blocked`)
|
||||
|
||||
To create a new workplan:
|
||||
1. Write the file following the format above
|
||||
2. Notify the custodian operator to run `make fix-consistency REPO=infospace-bench`
|
||||
(or send a message to the hub agent via `POST /messages/`)
|
||||
@@ -1,7 +1,11 @@
|
||||
# infospace-bench — Claude Code Instructions
|
||||
|
||||
@SCOPE.md
|
||||
@.claude/rules/repo-identity.md
|
||||
@.claude/rules/session-protocol.md
|
||||
@.claude/rules/first-session.md
|
||||
@.claude/rules/workplan-convention.md
|
||||
@.claude/rules/repo-boundary.md
|
||||
@.claude/rules/architecture.md
|
||||
@.claude/rules/stack-and-commands.md
|
||||
@.claude/rules/architecture.md
|
||||
@.claude/rules/repo-boundary.md
|
||||
@.claude/rules/agents.md
|
||||
|
||||
131
docs/routing-config.md
Normal file
131
docs/routing-config.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Routing Config Schema
|
||||
|
||||
Workplan: IB-WP-0020 (T01 schema, T02 loader)
|
||||
Module: `src/infospace_bench/routing_config.py`
|
||||
|
||||
A routing config is a small YAML file that names the candidate adapters
|
||||
per task type and (optionally) the quality floor, the
|
||||
`QualityLedger` path, and a stage-to-task-type override map. The file
|
||||
is the consumer side of llm-connect `LLM-WP-0004`'s routing primitives:
|
||||
it does not embed model selection logic, just declares the universe
|
||||
the policy can choose from.
|
||||
|
||||
The schema_version is pinned to `1`. Bump it (and the parser) before
|
||||
making backward-incompatible changes.
|
||||
|
||||
## Top-level fields
|
||||
|
||||
| Field | Type | Notes |
|
||||
|---|---|---|
|
||||
| `schema_version` | int (required) | Currently `1`. Mismatch fails fast. |
|
||||
| `task_types` | mapping (required) | At least one entry. Each entry has `candidates` and an optional `quality_floor`. |
|
||||
| `default_quality_floor` | float (optional) | Falls back when a task type does not name its own. Must be 0..1. |
|
||||
| `ledger_path` | string (optional) | Path to a `QualityLedger` JSONL. Relative paths resolve against the workspace by default. Required when any `quality_floor` is non-null. |
|
||||
| `stage_to_task_type` | mapping (optional) | Caller-supplied mapping from infospace-bench stage ids to task types. Falls through to identity when omitted. |
|
||||
|
||||
## Candidate fields
|
||||
|
||||
Each entry under `task_types.<task_type>.candidates[]`:
|
||||
|
||||
| Field | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | string (required) | Stable adapter id used for the `QualityLedger` and the per-stage adapter-choice line of the generation report. |
|
||||
| `provider` | string (required) | One of `openrouter`, `claude_code`, `openai`, `gemini`. |
|
||||
| `model` | string (required) | Provider-specific model id, e.g. `openai/gpt-4o-mini`. |
|
||||
| `api_key_env` | string (optional) | Env var that holds the API key. Defaults to a provider-specific name (`OPENROUTER_API_KEY` etc.) in the T02 loader. |
|
||||
| `max_cost_per_1k` | float (optional) | Static cost cap. Static `RoutingPolicy` falls back to a cheaper candidate when the caller-supplied estimate exceeds this. |
|
||||
|
||||
## Example A — OpenRouter-only, two-tier
|
||||
|
||||
A pragmatic Lefevre-style config. Cheap model for summaries, mid model
|
||||
for entities/relations, cheap again for evaluation. No adaptive
|
||||
routing, no ledger.
|
||||
|
||||
```yaml
|
||||
schema_version: 1
|
||||
|
||||
stage_to_task_type:
|
||||
summarize-source: cheap
|
||||
extract-entities: smart
|
||||
extract-relations: smart
|
||||
evaluate-entity: cheap
|
||||
synthesize-report: smart
|
||||
|
||||
task_types:
|
||||
cheap:
|
||||
candidates:
|
||||
- id: openrouter:gpt-4o-mini
|
||||
provider: openrouter
|
||||
model: openai/gpt-4o-mini
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
smart:
|
||||
candidates:
|
||||
- id: openrouter:claude-3.5-sonnet
|
||||
provider: openrouter
|
||||
model: anthropic/claude-3.5-sonnet
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
```
|
||||
|
||||
## Example B — Adaptive with a ClaudeCode baseline
|
||||
|
||||
A two-candidate-per-stage adaptive config. The `QualityLedger`
|
||||
accumulates observations; over time, the cheaper qualifying model is
|
||||
preferred per stage. `ClaudeCodeAdapter` is wired into a separate
|
||||
`task_types.baseline` rule so it can be referenced by a
|
||||
`ShadowingAdapter` builder (T05).
|
||||
|
||||
```yaml
|
||||
schema_version: 1
|
||||
default_quality_floor: 0.80
|
||||
ledger_path: output/routing/quality.jsonl
|
||||
|
||||
task_types:
|
||||
summarize-source:
|
||||
quality_floor: 0.70
|
||||
candidates:
|
||||
- id: openrouter:gpt-4o-mini
|
||||
provider: openrouter
|
||||
model: openai/gpt-4o-mini
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
max_cost_per_1k: 0.001
|
||||
- id: openrouter:claude-3.5-haiku
|
||||
provider: openrouter
|
||||
model: anthropic/claude-3.5-haiku
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
max_cost_per_1k: 0.003
|
||||
|
||||
extract-entities:
|
||||
quality_floor: 0.85
|
||||
candidates:
|
||||
- id: openrouter:claude-3.5-haiku
|
||||
provider: openrouter
|
||||
model: anthropic/claude-3.5-haiku
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
- id: openrouter:claude-3.5-sonnet
|
||||
provider: openrouter
|
||||
model: anthropic/claude-3.5-sonnet
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
|
||||
baseline:
|
||||
candidates:
|
||||
- id: claude-code
|
||||
provider: claude_code
|
||||
model: claude-opus-4-7
|
||||
```
|
||||
|
||||
## What fails fast
|
||||
|
||||
The parser refuses, before any network or workspace work, when:
|
||||
|
||||
- `schema_version` is missing or not `1`
|
||||
- `task_types` is missing or empty
|
||||
- Any `task_type` has no `candidates`
|
||||
- A candidate is missing `id`, `provider`, or `model`
|
||||
- A `provider` is not one of the supported names
|
||||
- `max_cost_per_1k` is non-numeric or negative
|
||||
- Any `quality_floor` (top-level or per-task) is outside 0..1
|
||||
- A `task_type` has duplicate candidate `id`s
|
||||
- `ledger_path` or `stage_to_task_type` has the wrong YAML shape
|
||||
|
||||
`api_key_env` resolution and live adapter construction happen in T02.
|
||||
This file only validates the declarative shape.
|
||||
@@ -256,6 +256,14 @@ def build_parser() -> argparse.ArgumentParser:
|
||||
)
|
||||
generate_from_source.add_argument("--apply", action="store_true")
|
||||
|
||||
routing = sub.add_parser("routing", help="Inspect llm-connect routing observations")
|
||||
routing_sub = routing.add_subparsers(dest="routing_command", required=True)
|
||||
routing_ledger = routing_sub.add_parser(
|
||||
"ledger",
|
||||
help="Summarise a llm-connect QualityLedger by (task_type, adapter_id)",
|
||||
)
|
||||
routing_ledger.add_argument("ledger_path")
|
||||
|
||||
budget = sub.add_parser("budget", help="Inspect per-infospace budget and usage records")
|
||||
budget_sub = budget.add_subparsers(dest="budget_command", required=True)
|
||||
budget_list = budget_sub.add_parser(
|
||||
@@ -587,6 +595,17 @@ def main(argv: list[str] | None = None) -> int:
|
||||
_write_json(plan_generation(infospace.root, stage=args.stage))
|
||||
else:
|
||||
parser.error(f"Unhandled generate command: {args.generate_command}")
|
||||
elif args.command == "routing":
|
||||
from .routing import summarise_quality_ledger
|
||||
if args.routing_command == "ledger":
|
||||
_write_json(
|
||||
{
|
||||
"ledger_path": str(Path(args.ledger_path)),
|
||||
"rows": summarise_quality_ledger(args.ledger_path),
|
||||
}
|
||||
)
|
||||
else:
|
||||
parser.error(f"Unhandled routing command: {args.routing_command}")
|
||||
elif args.command == "budget":
|
||||
from .budget import budget_list_workspace, budget_show
|
||||
if args.budget_command == "list":
|
||||
|
||||
@@ -791,6 +791,15 @@ def _write_generation_report(root: Path, metrics: dict[str, Any], snapshot_id: s
|
||||
"",
|
||||
]
|
||||
)
|
||||
if review.get("adapter_choices"):
|
||||
lines.extend(["## Per-stage adapter choices", ""])
|
||||
for row in review["adapter_choices"]:
|
||||
lines.append(
|
||||
f"- `{row['stage_id']}` ({row['task_type']}) -> "
|
||||
f"`{row['adapter_id']}` · {row['calls']} call(s) · "
|
||||
f"{row['prompt_tokens']} prompt + {row['completion_tokens']} completion tokens"
|
||||
)
|
||||
lines.append("")
|
||||
text = "\n".join(lines)
|
||||
path = root / "reports" / "generation-summary.md"
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
@@ -872,15 +881,55 @@ def _collect_review_report(root: Path) -> dict[str, Any]:
|
||||
entity_titles = sorted(
|
||||
{item.title for item in infospace.artifacts if item.kind == "entity" and item.title}
|
||||
)
|
||||
adapter_choices = _collect_adapter_choices(generated)
|
||||
return {
|
||||
"chapter_coverage": chapter_coverage,
|
||||
"entity_titles": entity_titles,
|
||||
"unmapped_sources": unmapped,
|
||||
"page_anchor_total": len(anchors),
|
||||
"page_anchor_sample": anchors[:6],
|
||||
"adapter_choices": adapter_choices,
|
||||
}
|
||||
|
||||
|
||||
def _collect_adapter_choices(generated: list[Any]) -> list[dict[str, Any]]:
|
||||
"""Roll up which adapter ran each stage when the routing bridge was used.
|
||||
|
||||
Returns one row per (stage_id, adapter_id) with call counts and
|
||||
cumulative tokens. Entries without provider_metadata are skipped so
|
||||
fixture-only runs produce an empty list rather than a noisy section.
|
||||
"""
|
||||
buckets: dict[tuple[str, str], dict[str, Any]] = {}
|
||||
for item in generated:
|
||||
provenance = item.provenance or {}
|
||||
metadata = provenance.get("provider_metadata") or {}
|
||||
if not isinstance(metadata, dict):
|
||||
continue
|
||||
adapter_id = str(metadata.get("adapter_id") or metadata.get("model") or "")
|
||||
if not adapter_id:
|
||||
continue
|
||||
stage_id = str(metadata.get("stage_id") or provenance.get("stage_id") or "")
|
||||
if not stage_id:
|
||||
continue
|
||||
usage = metadata.get("usage") or {}
|
||||
key = (stage_id, adapter_id)
|
||||
bucket = buckets.setdefault(
|
||||
key,
|
||||
{
|
||||
"stage_id": stage_id,
|
||||
"adapter_id": adapter_id,
|
||||
"task_type": metadata.get("task_type") or stage_id,
|
||||
"calls": 0,
|
||||
"prompt_tokens": 0,
|
||||
"completion_tokens": 0,
|
||||
},
|
||||
)
|
||||
bucket["calls"] += 1
|
||||
bucket["prompt_tokens"] += int(usage.get("prompt_tokens") or 0)
|
||||
bucket["completion_tokens"] += int(usage.get("completion_tokens") or 0)
|
||||
return sorted(buckets.values(), key=lambda row: (row["stage_id"], row["adapter_id"]))
|
||||
|
||||
|
||||
def _workflow_ids_for_stage(stage: str) -> list[str]:
|
||||
normalized = stage.strip().lower()
|
||||
if normalized == "intake":
|
||||
|
||||
@@ -15,8 +15,11 @@ from dataclasses import dataclass, field
|
||||
from typing import Any
|
||||
|
||||
from llm_connect.adapter import LLMAdapter
|
||||
from llm_connect.grading import BaselineGrader
|
||||
from llm_connect.models import RunConfig
|
||||
from llm_connect.quality import QualityLedger
|
||||
from llm_connect.routing import AdaptiveRoutingPolicy, RoutingPolicy
|
||||
from llm_connect.shadowing import ShadowingAdapter
|
||||
|
||||
from .workflow import AssistedGenerationRequest, AssistedGenerationResult
|
||||
|
||||
@@ -116,6 +119,88 @@ def _identify_adapter(adapter: LLMAdapter) -> str:
|
||||
return name
|
||||
|
||||
|
||||
def wrap_with_shadow_sampling(
|
||||
*,
|
||||
candidate: LLMAdapter,
|
||||
baseline: LLMAdapter,
|
||||
grader: BaselineGrader,
|
||||
ledger: QualityLedger,
|
||||
task_type: str,
|
||||
adapter_id: str | None = None,
|
||||
baseline_adapter_id: str | None = None,
|
||||
shadow_rate: float = 0.1,
|
||||
async_shadow: bool = True,
|
||||
on_shadow_error: Any | None = None,
|
||||
) -> ShadowingAdapter:
|
||||
"""Wrap ``candidate`` with llm-connect's ``ShadowingAdapter``.
|
||||
|
||||
Sampled baseline grading collects QualityLedger observations without
|
||||
changing the response the caller sees. Errors in the shadow path
|
||||
(baseline outage, grader failure, ledger write error) never alter the
|
||||
candidate response — failures land on ``on_shadow_error`` when
|
||||
provided, else are silently swallowed by the underlying adapter.
|
||||
|
||||
The returned ``ShadowingAdapter`` is still an ``LLMAdapter``, so it
|
||||
can be slotted into a ``RoutingPolicy`` rule and used through
|
||||
``RoutingAssistedGenerationAdapter`` without further changes.
|
||||
"""
|
||||
return ShadowingAdapter(
|
||||
candidate_adapter=candidate,
|
||||
baseline_adapter=baseline,
|
||||
grader=grader,
|
||||
ledger=ledger,
|
||||
task_type=task_type,
|
||||
adapter_id=adapter_id or _identify_adapter(candidate),
|
||||
baseline_adapter_id=baseline_adapter_id or _identify_adapter(baseline),
|
||||
shadow_rate=shadow_rate,
|
||||
async_shadow=async_shadow,
|
||||
on_shadow_error=on_shadow_error,
|
||||
)
|
||||
|
||||
|
||||
def summarise_quality_ledger(
|
||||
ledger_path: str | Any,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Roll up a QualityLedger into one row per (task_type, adapter_id).
|
||||
|
||||
Useful as a CLI helper or a quick budget-style inspection without
|
||||
loading llm-connect's full ledger API at the call site.
|
||||
"""
|
||||
from pathlib import Path
|
||||
|
||||
ledger = QualityLedger(path=Path(ledger_path))
|
||||
observations = ledger.read_all()
|
||||
grouped: dict[tuple[str, str], dict[str, Any]] = {}
|
||||
for obs in observations:
|
||||
key = (obs.task_type, obs.adapter_id)
|
||||
bucket = grouped.setdefault(
|
||||
key,
|
||||
{
|
||||
"task_type": obs.task_type,
|
||||
"adapter_id": obs.adapter_id,
|
||||
"observations": 0,
|
||||
"mean_quality": 0.0,
|
||||
"mean_cost_usd": 0.0,
|
||||
"total_tokens_in": 0,
|
||||
"total_tokens_out": 0,
|
||||
},
|
||||
)
|
||||
bucket["observations"] += 1
|
||||
bucket["mean_quality"] += float(obs.quality_score)
|
||||
bucket["mean_cost_usd"] += float(obs.cost_usd)
|
||||
bucket["total_tokens_in"] += int(getattr(obs, "tokens_in", 0) or 0)
|
||||
bucket["total_tokens_out"] += int(getattr(obs, "tokens_out", 0) or 0)
|
||||
rows: list[dict[str, Any]] = []
|
||||
for bucket in grouped.values():
|
||||
count = bucket["observations"]
|
||||
if count:
|
||||
bucket["mean_quality"] = round(bucket["mean_quality"] / count, 4)
|
||||
bucket["mean_cost_usd"] = round(bucket["mean_cost_usd"] / count, 6)
|
||||
rows.append(bucket)
|
||||
rows.sort(key=lambda row: (row["task_type"], row["adapter_id"]))
|
||||
return rows
|
||||
|
||||
|
||||
def _provider_tag(adapter: LLMAdapter) -> str:
|
||||
"""Coarse provider tag matching the strings already used in run records.
|
||||
|
||||
|
||||
265
src/infospace_bench/routing_config.py
Normal file
265
src/infospace_bench/routing_config.py
Normal file
@@ -0,0 +1,265 @@
|
||||
"""
|
||||
Routing config schema (IB-WP-0020-T01).
|
||||
|
||||
Parser-only: this module reads a YAML file into validated dataclasses.
|
||||
The follow-on task T02 takes a ``RoutingConfig`` and constructs the
|
||||
actual llm-connect ``RoutingPolicy`` / ``AdaptiveRoutingPolicy`` plus
|
||||
LLMAdapter instances (which involves API keys and provider-specific
|
||||
construction). Keeping parsing separate lets T01 stay network-free and
|
||||
deterministically testable.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import yaml
|
||||
|
||||
from .errors import InfospaceError
|
||||
|
||||
ROUTING_SCHEMA_VERSION = 1
|
||||
|
||||
# Provider names that the T02 loader will know how to construct.
|
||||
# Validation happens here so a config typo fails before any work begins.
|
||||
SUPPORTED_PROVIDERS: frozenset[str] = frozenset(
|
||||
{"openrouter", "claude_code", "openai", "gemini"}
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RoutingCandidateConfig:
|
||||
"""One candidate adapter inside a task_type rule."""
|
||||
|
||||
id: str
|
||||
provider: str
|
||||
model: str
|
||||
api_key_env: str = ""
|
||||
max_cost_per_1k: float | None = None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RoutingTaskTypeConfig:
|
||||
"""All candidate adapters for one task_type, with an optional quality floor."""
|
||||
|
||||
task_type: str
|
||||
candidates: tuple[RoutingCandidateConfig, ...]
|
||||
quality_floor: float | None = None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RoutingConfig:
|
||||
"""Top-level routing config payload, parsed from YAML."""
|
||||
|
||||
schema_version: int
|
||||
task_types: tuple[RoutingTaskTypeConfig, ...]
|
||||
default_quality_floor: float | None = None
|
||||
ledger_path: str | None = None
|
||||
stage_to_task_type: dict[str, str] = field(default_factory=dict)
|
||||
|
||||
|
||||
def load_routing_config(path: str | Path) -> RoutingConfig:
|
||||
"""Read and validate a routing config YAML file."""
|
||||
config_path = Path(path)
|
||||
if not config_path.is_file():
|
||||
raise InfospaceError(
|
||||
"missing_routing_config",
|
||||
f"Routing config does not exist: {config_path}",
|
||||
{"path": str(config_path)},
|
||||
)
|
||||
raw_text = config_path.read_text(encoding="utf-8")
|
||||
try:
|
||||
data = yaml.safe_load(raw_text)
|
||||
except yaml.YAMLError as exc:
|
||||
raise InfospaceError(
|
||||
"invalid_routing_config_yaml",
|
||||
f"Routing config is not valid YAML: {exc}",
|
||||
{"path": str(config_path)},
|
||||
) from exc
|
||||
if not isinstance(data, dict):
|
||||
raise InfospaceError(
|
||||
"invalid_routing_config",
|
||||
"Routing config must be a YAML mapping at the top level",
|
||||
{"path": str(config_path)},
|
||||
)
|
||||
return parse_routing_config(data, source=str(config_path))
|
||||
|
||||
|
||||
def parse_routing_config(
|
||||
data: dict[str, Any], *, source: str = "<inline>"
|
||||
) -> RoutingConfig:
|
||||
"""Validate a parsed routing config dict and return a frozen config."""
|
||||
schema_version = data.get("schema_version")
|
||||
if not isinstance(schema_version, int) or schema_version != ROUTING_SCHEMA_VERSION:
|
||||
raise InfospaceError(
|
||||
"unsupported_routing_schema",
|
||||
f"Routing config schema_version must be {ROUTING_SCHEMA_VERSION}",
|
||||
{"source": source, "got": schema_version},
|
||||
)
|
||||
task_types_raw = data.get("task_types") or {}
|
||||
if not isinstance(task_types_raw, dict) or not task_types_raw:
|
||||
raise InfospaceError(
|
||||
"empty_routing_task_types",
|
||||
"Routing config must declare at least one task_type with candidates",
|
||||
{"source": source},
|
||||
)
|
||||
|
||||
task_types: list[RoutingTaskTypeConfig] = []
|
||||
for task_type, entry in task_types_raw.items():
|
||||
task_types.append(_parse_task_type(str(task_type), entry, source=source))
|
||||
|
||||
default_floor = _optional_quality_floor(
|
||||
data.get("default_quality_floor"), "default_quality_floor", source
|
||||
)
|
||||
ledger_path_value = data.get("ledger_path")
|
||||
if ledger_path_value is not None and not isinstance(ledger_path_value, str):
|
||||
raise InfospaceError(
|
||||
"invalid_routing_ledger_path",
|
||||
"ledger_path must be a string when present",
|
||||
{"source": source},
|
||||
)
|
||||
|
||||
stage_map_raw = data.get("stage_to_task_type") or {}
|
||||
if not isinstance(stage_map_raw, dict):
|
||||
raise InfospaceError(
|
||||
"invalid_routing_stage_map",
|
||||
"stage_to_task_type must be a mapping",
|
||||
{"source": source},
|
||||
)
|
||||
stage_to_task_type = {str(key): str(value) for key, value in stage_map_raw.items()}
|
||||
|
||||
return RoutingConfig(
|
||||
schema_version=schema_version,
|
||||
task_types=tuple(task_types),
|
||||
default_quality_floor=default_floor,
|
||||
ledger_path=ledger_path_value if isinstance(ledger_path_value, str) else None,
|
||||
stage_to_task_type=stage_to_task_type,
|
||||
)
|
||||
|
||||
|
||||
def _parse_task_type(
|
||||
task_type: str, entry: Any, *, source: str
|
||||
) -> RoutingTaskTypeConfig:
|
||||
if not isinstance(entry, dict):
|
||||
raise InfospaceError(
|
||||
"invalid_routing_task_type",
|
||||
f"task_types.{task_type} must be a mapping",
|
||||
{"source": source, "task_type": task_type},
|
||||
)
|
||||
candidates_raw = entry.get("candidates") or []
|
||||
if not isinstance(candidates_raw, list) or not candidates_raw:
|
||||
raise InfospaceError(
|
||||
"empty_routing_candidates",
|
||||
f"task_types.{task_type} must declare at least one candidate",
|
||||
{"source": source, "task_type": task_type},
|
||||
)
|
||||
candidates: list[RoutingCandidateConfig] = []
|
||||
seen_ids: set[str] = set()
|
||||
for index, candidate_raw in enumerate(candidates_raw):
|
||||
candidate = _parse_candidate(task_type, index, candidate_raw, source=source)
|
||||
if candidate.id in seen_ids:
|
||||
raise InfospaceError(
|
||||
"duplicate_routing_candidate_id",
|
||||
f"task_types.{task_type} has duplicate candidate id {candidate.id!r}",
|
||||
{"source": source, "task_type": task_type, "id": candidate.id},
|
||||
)
|
||||
seen_ids.add(candidate.id)
|
||||
candidates.append(candidate)
|
||||
quality_floor = _optional_quality_floor(
|
||||
entry.get("quality_floor"),
|
||||
f"task_types.{task_type}.quality_floor",
|
||||
source,
|
||||
)
|
||||
return RoutingTaskTypeConfig(
|
||||
task_type=task_type,
|
||||
candidates=tuple(candidates),
|
||||
quality_floor=quality_floor,
|
||||
)
|
||||
|
||||
|
||||
def _parse_candidate(
|
||||
task_type: str, index: int, candidate_raw: Any, *, source: str
|
||||
) -> RoutingCandidateConfig:
|
||||
if not isinstance(candidate_raw, dict):
|
||||
raise InfospaceError(
|
||||
"invalid_routing_candidate",
|
||||
f"task_types.{task_type}.candidates[{index}] must be a mapping",
|
||||
{"source": source, "task_type": task_type, "index": index},
|
||||
)
|
||||
candidate_id = str(candidate_raw.get("id") or "").strip()
|
||||
provider = str(candidate_raw.get("provider") or "").strip().lower()
|
||||
model = str(candidate_raw.get("model") or "").strip()
|
||||
missing = [
|
||||
field_name
|
||||
for field_name, value in (("id", candidate_id), ("provider", provider), ("model", model))
|
||||
if not value
|
||||
]
|
||||
if missing:
|
||||
raise InfospaceError(
|
||||
"missing_routing_candidate_field",
|
||||
f"task_types.{task_type}.candidates[{index}] is missing required fields: "
|
||||
f"{', '.join(missing)}",
|
||||
{
|
||||
"source": source,
|
||||
"task_type": task_type,
|
||||
"index": index,
|
||||
"missing": missing,
|
||||
},
|
||||
)
|
||||
if provider not in SUPPORTED_PROVIDERS:
|
||||
raise InfospaceError(
|
||||
"unsupported_routing_provider",
|
||||
f"Unsupported provider {provider!r}; allowed: {sorted(SUPPORTED_PROVIDERS)}",
|
||||
{
|
||||
"source": source,
|
||||
"task_type": task_type,
|
||||
"index": index,
|
||||
"provider": provider,
|
||||
},
|
||||
)
|
||||
max_cost = _optional_float(
|
||||
candidate_raw.get("max_cost_per_1k"),
|
||||
f"task_types.{task_type}.candidates[{index}].max_cost_per_1k",
|
||||
source,
|
||||
)
|
||||
if max_cost is not None and max_cost < 0:
|
||||
raise InfospaceError(
|
||||
"invalid_routing_max_cost",
|
||||
"max_cost_per_1k must be non-negative",
|
||||
{"source": source, "task_type": task_type, "index": index, "value": max_cost},
|
||||
)
|
||||
api_key_env = str(candidate_raw.get("api_key_env") or "").strip()
|
||||
return RoutingCandidateConfig(
|
||||
id=candidate_id,
|
||||
provider=provider,
|
||||
model=model,
|
||||
api_key_env=api_key_env,
|
||||
max_cost_per_1k=max_cost,
|
||||
)
|
||||
|
||||
|
||||
def _optional_float(value: Any, name: str, source: str) -> float | None:
|
||||
if value is None:
|
||||
return None
|
||||
try:
|
||||
return float(value)
|
||||
except (TypeError, ValueError) as exc:
|
||||
raise InfospaceError(
|
||||
"invalid_routing_float",
|
||||
f"{name} must be numeric",
|
||||
{"source": source, "value": value},
|
||||
) from exc
|
||||
|
||||
|
||||
def _optional_quality_floor(value: Any, name: str, source: str) -> float | None:
|
||||
floor = _optional_float(value, name, source)
|
||||
if floor is None:
|
||||
return None
|
||||
if not 0 <= floor <= 1:
|
||||
raise InfospaceError(
|
||||
"invalid_routing_quality_floor",
|
||||
f"{name} must be between 0 and 1",
|
||||
{"source": source, "name": name, "value": floor},
|
||||
)
|
||||
return floor
|
||||
@@ -213,6 +213,200 @@ def test_bridge_preserves_response_metadata_and_provider_tag() -> None:
|
||||
assert result.provider == "mock"
|
||||
|
||||
|
||||
def test_wrap_with_shadow_sampling_passes_candidate_through(tmp_path) -> None:
|
||||
from llm_connect.grading import ExactMatchJudge, PairedGrader
|
||||
from infospace_bench.routing import wrap_with_shadow_sampling
|
||||
|
||||
candidate = _MockAdapter(model="cheap-1", content="match")
|
||||
baseline = _MockAdapter(model="baseline-1", content="match")
|
||||
ledger = QualityLedger(path=tmp_path / "quality.jsonl")
|
||||
grader = PairedGrader(judge=ExactMatchJudge())
|
||||
|
||||
shadow = wrap_with_shadow_sampling(
|
||||
candidate=candidate,
|
||||
baseline=baseline,
|
||||
grader=grader,
|
||||
ledger=ledger,
|
||||
task_type="extract-entities",
|
||||
shadow_rate=1.0,
|
||||
async_shadow=False,
|
||||
)
|
||||
|
||||
config = RunConfig(model_name="cheap-1")
|
||||
response = shadow.execute_prompt("Hello.", config)
|
||||
|
||||
assert response.content == "match"
|
||||
# Baseline ran in the shadow path; ledger now has one observation.
|
||||
assert baseline.calls, "baseline must have been called when shadow_rate=1.0"
|
||||
observations = ledger.by_task_type("extract-entities")
|
||||
assert observations, "shadow path should append at least one observation"
|
||||
|
||||
|
||||
def test_wrap_with_shadow_sampling_isolates_baseline_failure(tmp_path) -> None:
|
||||
from llm_connect.grading import ExactMatchJudge, PairedGrader
|
||||
from infospace_bench.routing import wrap_with_shadow_sampling
|
||||
|
||||
candidate = _MockAdapter(model="cheap-1", content="ok")
|
||||
|
||||
class _AngryBaseline(LLMAdapter):
|
||||
def execute_prompt(self, prompt, config):
|
||||
raise RuntimeError("baseline outage")
|
||||
|
||||
def validate_config(self, config):
|
||||
return True
|
||||
|
||||
seen_errors: list[Exception] = []
|
||||
shadow = wrap_with_shadow_sampling(
|
||||
candidate=candidate,
|
||||
baseline=_AngryBaseline(),
|
||||
grader=PairedGrader(judge=ExactMatchJudge()),
|
||||
ledger=QualityLedger(path=tmp_path / "quality.jsonl"),
|
||||
task_type="summarize-source",
|
||||
shadow_rate=1.0,
|
||||
async_shadow=False,
|
||||
on_shadow_error=seen_errors.append,
|
||||
)
|
||||
response = shadow.execute_prompt("Hello.", RunConfig(model_name="cheap-1"))
|
||||
|
||||
assert response.content == "ok", "candidate response must survive baseline outage"
|
||||
assert seen_errors and "baseline outage" in str(seen_errors[0])
|
||||
|
||||
|
||||
def test_summarise_quality_ledger_rolls_up_by_task_and_adapter(tmp_path) -> None:
|
||||
from infospace_bench.routing import summarise_quality_ledger
|
||||
|
||||
ledger_path = tmp_path / "quality.jsonl"
|
||||
ledger = QualityLedger(path=ledger_path)
|
||||
for quality in (0.9, 0.95, 0.85):
|
||||
ledger.append(
|
||||
QualityObservation(
|
||||
task_type="extract-entities",
|
||||
adapter_id="cheap-1",
|
||||
model_id="cheap-1",
|
||||
cost_usd=0.001,
|
||||
quality_score=quality,
|
||||
tokens_in=100,
|
||||
tokens_out=50,
|
||||
latency_ms=10,
|
||||
)
|
||||
)
|
||||
ledger.append(
|
||||
QualityObservation(
|
||||
task_type="summarize-source",
|
||||
adapter_id="cheaper-1",
|
||||
model_id="cheaper-1",
|
||||
cost_usd=0.0001,
|
||||
quality_score=0.7,
|
||||
tokens_in=80,
|
||||
tokens_out=20,
|
||||
latency_ms=5,
|
||||
)
|
||||
)
|
||||
|
||||
rows = summarise_quality_ledger(ledger_path)
|
||||
|
||||
by_key = {(row["task_type"], row["adapter_id"]): row for row in rows}
|
||||
extract = by_key[("extract-entities", "cheap-1")]
|
||||
assert extract["observations"] == 3
|
||||
assert extract["mean_quality"] == round((0.9 + 0.95 + 0.85) / 3, 4)
|
||||
assert extract["mean_cost_usd"] == 0.001
|
||||
summarize = by_key[("summarize-source", "cheaper-1")]
|
||||
assert summarize["observations"] == 1
|
||||
|
||||
|
||||
def test_collect_adapter_choices_rolls_up_per_stage(tmp_path) -> None:
|
||||
"""Unit test: report helper aggregates adapter choices from artifact provenance."""
|
||||
from infospace_bench.generator import _collect_adapter_choices
|
||||
|
||||
class _FakeArtifact:
|
||||
def __init__(self, kind: str, provenance: dict) -> None:
|
||||
self.kind = kind
|
||||
self.provenance = provenance
|
||||
|
||||
artifacts = [
|
||||
_FakeArtifact(
|
||||
kind="entity",
|
||||
provenance={
|
||||
"stage_id": "extract-entities",
|
||||
"provider_metadata": {
|
||||
"adapter_id": "_MockAdapter:cheap-1",
|
||||
"task_type": "extract-entities",
|
||||
"usage": {"prompt_tokens": 120, "completion_tokens": 40},
|
||||
},
|
||||
},
|
||||
),
|
||||
_FakeArtifact(
|
||||
kind="entity",
|
||||
provenance={
|
||||
"stage_id": "extract-entities",
|
||||
"provider_metadata": {
|
||||
"adapter_id": "_MockAdapter:cheap-1",
|
||||
"task_type": "extract-entities",
|
||||
"usage": {"prompt_tokens": 130, "completion_tokens": 50},
|
||||
},
|
||||
},
|
||||
),
|
||||
_FakeArtifact(
|
||||
kind="relation",
|
||||
provenance={
|
||||
"stage_id": "extract-relations",
|
||||
"provider_metadata": {
|
||||
"adapter_id": "_MockAdapter:smart-1",
|
||||
"task_type": "extract-relations",
|
||||
"usage": {"prompt_tokens": 200, "completion_tokens": 80},
|
||||
},
|
||||
},
|
||||
),
|
||||
# Artifact without provider_metadata should be ignored.
|
||||
_FakeArtifact(kind="generated", provenance={"stage_id": "summarize-source"}),
|
||||
]
|
||||
|
||||
rows = _collect_adapter_choices(artifacts)
|
||||
|
||||
by_key = {(row["stage_id"], row["adapter_id"]): row for row in rows}
|
||||
entities_row = by_key[("extract-entities", "_MockAdapter:cheap-1")]
|
||||
relations_row = by_key[("extract-relations", "_MockAdapter:smart-1")]
|
||||
assert entities_row["calls"] == 2
|
||||
assert entities_row["prompt_tokens"] == 250
|
||||
assert entities_row["completion_tokens"] == 90
|
||||
assert relations_row["calls"] == 1
|
||||
assert relations_row["task_type"] == "extract-relations"
|
||||
|
||||
|
||||
def test_routing_ledger_cli(tmp_path) -> None:
|
||||
import json as _json
|
||||
import subprocess as _sub
|
||||
import sys as _sys
|
||||
import os as _os
|
||||
|
||||
ledger_path = tmp_path / "quality.jsonl"
|
||||
ledger = QualityLedger(path=ledger_path)
|
||||
ledger.append(
|
||||
QualityObservation(
|
||||
task_type="extract-entities",
|
||||
adapter_id="cheap-1",
|
||||
model_id="cheap-1",
|
||||
cost_usd=0.001,
|
||||
quality_score=0.9,
|
||||
tokens_in=100,
|
||||
tokens_out=50,
|
||||
latency_ms=10,
|
||||
)
|
||||
)
|
||||
|
||||
env = _os.environ.copy()
|
||||
env["PYTHONPATH"] = "src:/home/worsch/markitect-tool/src:/home/worsch/llm-connect"
|
||||
result = _sub.run(
|
||||
[_sys.executable, "-m", "infospace_bench", "routing", "ledger", str(ledger_path)],
|
||||
check=False, env=env, text=True, capture_output=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, result.stderr
|
||||
payload = _json.loads(result.stdout)
|
||||
assert payload["ledger_path"] == str(ledger_path)
|
||||
assert payload["rows"] and payload["rows"][0]["task_type"] == "extract-entities"
|
||||
|
||||
|
||||
def test_bridge_passes_estimated_cost_per_1k_through() -> None:
|
||||
captured: dict[str, Any] = {}
|
||||
|
||||
|
||||
272
tests/test_routing_config.py
Normal file
272
tests/test_routing_config.py
Normal file
@@ -0,0 +1,272 @@
|
||||
"""
|
||||
Tests for the routing config schema (IB-WP-0020-T01).
|
||||
|
||||
Parser-only — no network calls, no llm-connect construction. T02 will
|
||||
test the provider construction loader separately.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
import yaml
|
||||
|
||||
from infospace_bench.errors import InfospaceError
|
||||
from infospace_bench.routing_config import (
|
||||
ROUTING_SCHEMA_VERSION,
|
||||
RoutingCandidateConfig,
|
||||
RoutingConfig,
|
||||
RoutingTaskTypeConfig,
|
||||
load_routing_config,
|
||||
parse_routing_config,
|
||||
)
|
||||
|
||||
|
||||
MINIMAL = {
|
||||
"schema_version": 1,
|
||||
"task_types": {
|
||||
"summarize-source": {
|
||||
"candidates": [
|
||||
{
|
||||
"id": "openrouter:gpt-4o-mini",
|
||||
"provider": "openrouter",
|
||||
"model": "openai/gpt-4o-mini",
|
||||
},
|
||||
],
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def test_parses_minimal_config() -> None:
|
||||
config = parse_routing_config(MINIMAL)
|
||||
|
||||
assert config.schema_version == ROUTING_SCHEMA_VERSION
|
||||
assert config.default_quality_floor is None
|
||||
assert config.ledger_path is None
|
||||
assert config.stage_to_task_type == {}
|
||||
assert len(config.task_types) == 1
|
||||
task = config.task_types[0]
|
||||
assert task.task_type == "summarize-source"
|
||||
assert task.quality_floor is None
|
||||
assert len(task.candidates) == 1
|
||||
candidate = task.candidates[0]
|
||||
assert candidate.id == "openrouter:gpt-4o-mini"
|
||||
assert candidate.provider == "openrouter"
|
||||
assert candidate.model == "openai/gpt-4o-mini"
|
||||
assert candidate.api_key_env == ""
|
||||
assert candidate.max_cost_per_1k is None
|
||||
|
||||
|
||||
def test_parses_full_config_round_trip() -> None:
|
||||
data = {
|
||||
"schema_version": 1,
|
||||
"default_quality_floor": 0.8,
|
||||
"ledger_path": "output/routing/quality.jsonl",
|
||||
"stage_to_task_type": {
|
||||
"extract-entities": "smart",
|
||||
"extract-relations": "smart",
|
||||
},
|
||||
"task_types": {
|
||||
"cheap": {
|
||||
"quality_floor": 0.7,
|
||||
"candidates": [
|
||||
{
|
||||
"id": "openrouter:gpt-4o-mini",
|
||||
"provider": "openrouter",
|
||||
"model": "openai/gpt-4o-mini",
|
||||
"api_key_env": "OPENROUTER_API_KEY",
|
||||
"max_cost_per_1k": 0.001,
|
||||
},
|
||||
],
|
||||
},
|
||||
"smart": {
|
||||
"quality_floor": 0.85,
|
||||
"candidates": [
|
||||
{
|
||||
"id": "openrouter:claude-haiku",
|
||||
"provider": "openrouter",
|
||||
"model": "anthropic/claude-3.5-haiku",
|
||||
},
|
||||
{
|
||||
"id": "openrouter:claude-sonnet",
|
||||
"provider": "openrouter",
|
||||
"model": "anthropic/claude-3.5-sonnet",
|
||||
"max_cost_per_1k": 0.003,
|
||||
},
|
||||
],
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
config = parse_routing_config(data)
|
||||
|
||||
assert config.default_quality_floor == 0.8
|
||||
assert config.ledger_path == "output/routing/quality.jsonl"
|
||||
assert config.stage_to_task_type == {
|
||||
"extract-entities": "smart",
|
||||
"extract-relations": "smart",
|
||||
}
|
||||
smart = next(t for t in config.task_types if t.task_type == "smart")
|
||||
assert smart.quality_floor == 0.85
|
||||
assert len(smart.candidates) == 2
|
||||
assert smart.candidates[1].max_cost_per_1k == 0.003
|
||||
|
||||
|
||||
def test_load_routing_config_reads_yaml_file(tmp_path: Path) -> None:
|
||||
config_path = tmp_path / "routing.yaml"
|
||||
config_path.write_text(yaml.safe_dump(MINIMAL, sort_keys=False), encoding="utf-8")
|
||||
|
||||
config = load_routing_config(config_path)
|
||||
|
||||
assert isinstance(config, RoutingConfig)
|
||||
assert config.schema_version == 1
|
||||
|
||||
|
||||
def test_load_routing_config_missing_file(tmp_path: Path) -> None:
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
load_routing_config(tmp_path / "missing.yaml")
|
||||
assert exc_info.value.code == "missing_routing_config"
|
||||
|
||||
|
||||
def test_load_routing_config_bad_yaml(tmp_path: Path) -> None:
|
||||
config_path = tmp_path / "broken.yaml"
|
||||
config_path.write_text("schema_version: 1\n bad: indent\n: : : :\n", encoding="utf-8")
|
||||
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
load_routing_config(config_path)
|
||||
assert exc_info.value.code == "invalid_routing_config_yaml"
|
||||
|
||||
|
||||
def test_rejects_wrong_schema_version() -> None:
|
||||
payload = {**MINIMAL, "schema_version": 2}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "unsupported_routing_schema"
|
||||
|
||||
|
||||
def test_rejects_missing_schema_version() -> None:
|
||||
payload = {"task_types": MINIMAL["task_types"]}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "unsupported_routing_schema"
|
||||
|
||||
|
||||
def test_rejects_empty_task_types() -> None:
|
||||
payload = {"schema_version": 1, "task_types": {}}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "empty_routing_task_types"
|
||||
|
||||
|
||||
def test_rejects_task_type_without_candidates() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": {"foo": {"candidates": []}},
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "empty_routing_candidates"
|
||||
|
||||
|
||||
def test_rejects_candidate_missing_required_field() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": {
|
||||
"foo": {
|
||||
"candidates": [{"provider": "openrouter", "model": "x"}], # missing id
|
||||
},
|
||||
},
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "missing_routing_candidate_field"
|
||||
assert "id" in exc_info.value.detail["missing"]
|
||||
|
||||
|
||||
def test_rejects_unsupported_provider() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": {
|
||||
"foo": {
|
||||
"candidates": [
|
||||
{"id": "x", "provider": "acme", "model": "acme/model"},
|
||||
],
|
||||
},
|
||||
},
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "unsupported_routing_provider"
|
||||
|
||||
|
||||
def test_rejects_negative_max_cost() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": {
|
||||
"foo": {
|
||||
"candidates": [
|
||||
{
|
||||
"id": "x",
|
||||
"provider": "openrouter",
|
||||
"model": "openai/gpt-4o-mini",
|
||||
"max_cost_per_1k": -1,
|
||||
},
|
||||
],
|
||||
},
|
||||
},
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "invalid_routing_max_cost"
|
||||
|
||||
|
||||
def test_rejects_quality_floor_out_of_range() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"default_quality_floor": 1.5,
|
||||
"task_types": MINIMAL["task_types"],
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "invalid_routing_quality_floor"
|
||||
|
||||
|
||||
def test_rejects_duplicate_candidate_ids_within_task_type() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": {
|
||||
"foo": {
|
||||
"candidates": [
|
||||
{"id": "dupe", "provider": "openrouter", "model": "a"},
|
||||
{"id": "dupe", "provider": "openrouter", "model": "b"},
|
||||
],
|
||||
},
|
||||
},
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "duplicate_routing_candidate_id"
|
||||
|
||||
|
||||
def test_rejects_non_mapping_stage_map() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": MINIMAL["task_types"],
|
||||
"stage_to_task_type": ["not", "a", "mapping"],
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "invalid_routing_stage_map"
|
||||
|
||||
|
||||
def test_rejects_non_string_ledger_path() -> None:
|
||||
payload = {
|
||||
"schema_version": 1,
|
||||
"task_types": MINIMAL["task_types"],
|
||||
"ledger_path": 42,
|
||||
}
|
||||
with pytest.raises(InfospaceError) as exc_info:
|
||||
parse_routing_config(payload)
|
||||
assert exc_info.value.code == "invalid_routing_ledger_path"
|
||||
@@ -4,11 +4,11 @@ type: workplan
|
||||
title: "Adaptive LLM Routing — infospace-bench Consumer Wiring"
|
||||
domain: markitect
|
||||
repo: infospace-bench
|
||||
status: blocked
|
||||
status: done
|
||||
owner: markitect
|
||||
topic_slug: markitect
|
||||
created: "2026-05-17"
|
||||
updated: "2026-05-17"
|
||||
updated: "2026-05-18"
|
||||
depends_on_workplans:
|
||||
- LLM-WP-0004
|
||||
related_workplans:
|
||||
@@ -33,7 +33,22 @@ list will be refined once that API is stable.
|
||||
|
||||
## Status
|
||||
|
||||
Blocked on `LLM-WP-0004` T01..T03.
|
||||
Done. LLM-WP-0004 landed `QualityLedger`, `QualityObservation`,
|
||||
`BaselineGrader`/`PairedGrader`/`ExactMatchJudge`/`EmbeddingSimilarityJudge`/
|
||||
`LLMJudge`, `AdaptiveRoutingPolicy`, and `ShadowingAdapter` in
|
||||
llm-connect; the five tasks below are all complete.
|
||||
|
||||
- T01 — task-type taxonomy (`docs/routing-task-types.md`)
|
||||
- T02 — `RoutingAssistedGenerationAdapter` bridge in
|
||||
`src/infospace_bench/routing.py`
|
||||
- T03 — `wrap_with_shadow_sampling()` helper that opt-in installs
|
||||
llm-connect's `ShadowingAdapter` around any candidate
|
||||
- T04 — `## Per-stage adapter choices` section in
|
||||
`reports/generation-summary.md` (driven from artifact
|
||||
`provenance.provider_metadata`) and `infospace-bench routing ledger`
|
||||
CLI subcommand
|
||||
- T05 — `tests/test_routing_adapter.py` (13 tests, including a CLI
|
||||
smoke and the adapter-choices unit test)
|
||||
|
||||
## Why this is a separate workplan
|
||||
|
||||
|
||||
211
workplans/IB-WP-0020-provider-routing-cli.md
Normal file
211
workplans/IB-WP-0020-provider-routing-cli.md
Normal file
@@ -0,0 +1,211 @@
|
||||
---
|
||||
id: IB-WP-0020
|
||||
type: workplan
|
||||
title: "Provider Routing CLI Integration"
|
||||
domain: markitect
|
||||
repo: infospace-bench
|
||||
status: active
|
||||
owner: markitect
|
||||
topic_slug: markitect
|
||||
created: "2026-05-18"
|
||||
updated: "2026-05-18"
|
||||
depends_on_workplans:
|
||||
- IB-WP-0018
|
||||
- LLM-WP-0004
|
||||
related_workplans:
|
||||
- IB-WP-0016
|
||||
- IB-WP-0019
|
||||
state_hub_workstream_slug: "ib-wp-0020-provider-routing-cli"
|
||||
state_hub_workstream_id: "172bb082-610a-477b-b5e0-26c9f4bdfd95"
|
||||
---
|
||||
|
||||
# IB-WP-0020 — Provider Routing CLI Integration
|
||||
|
||||
## Goal
|
||||
|
||||
Expose `RoutingAssistedGenerationAdapter` (IB-WP-0018) as a first-class
|
||||
CLI option so a real multi-chapter or full-book run can use the
|
||||
adaptive router without writing any Python. Today `--provider` accepts
|
||||
`fixture` and `openrouter`; this workplan adds `routing`, plus a small
|
||||
config file that names the rules, the ledger, the quality floors, and
|
||||
the per-stage task-type overrides.
|
||||
|
||||
The end state is a single command that does cost-aware adaptive
|
||||
routing across multiple OpenRouter models and writes back the
|
||||
per-stage adapter choices, the budget log, and (optionally) sampled
|
||||
shadow grades:
|
||||
|
||||
```bash
|
||||
infospace-bench generate from-source ./LEFEVRE.epub \
|
||||
--workspace ./infospaces \
|
||||
--slug reminiscences-routed \
|
||||
--name "Reminiscences (Routed)" \
|
||||
--profile trading-literature \
|
||||
--provider routing \
|
||||
--routing-config ./routing.yaml \
|
||||
--chapter I \
|
||||
--apply
|
||||
```
|
||||
|
||||
## Why this is a separate workplan
|
||||
|
||||
`IB-WP-0018` shipped the bridge module and its programmatic API. CLI
|
||||
wiring needs its own config-file schema, its own loader, its own error
|
||||
surfaces, and its own end-to-end smoke test — and that is enough scope
|
||||
to justify a separate review surface rather than absorbing it into the
|
||||
already-closed IB-WP-0018.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Owning the routing policy primitives (those live in
|
||||
`llm-connect` LLM-WP-0004).
|
||||
- Replacing the static `openrouter` provider — that path stays usable
|
||||
for callers who do not want the router.
|
||||
- Embedding model selection logic inside the CLI; the config file is
|
||||
declarative and routing decisions stay with `AdaptiveRoutingPolicy`.
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Routing config file schema
|
||||
|
||||
```task
|
||||
id: IB-WP-0020-T01
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "39597441-22ab-4dcf-b68d-b045823a9374"
|
||||
```
|
||||
|
||||
- Define a small YAML schema for a routing config:
|
||||
- `quality_floor: <float | null>` (global default)
|
||||
- `ledger_path: <str | null>` (relative to workspace by default)
|
||||
- `task_types`: map of task_type to a list of candidate adapters,
|
||||
each with `id`, `provider` (`openrouter`, `claude_code`,
|
||||
`openai`, …), `model`, `api_key_env`, optional `max_cost_per_1k`,
|
||||
optional `quality_floor` override
|
||||
- `stage_to_task_type`: optional override map
|
||||
- Document the schema in `docs/routing-config.md` with two annotated
|
||||
examples (one OpenRouter-only, one ClaudeCode-as-baseline +
|
||||
OpenRouter candidates).
|
||||
- Tests: schema parses; missing fields default cleanly; unknown
|
||||
providers raise a focused error.
|
||||
|
||||
### T02 — Routing config loader
|
||||
|
||||
```task
|
||||
id: IB-WP-0020-T02
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "5e38514b-ad6a-4d39-8716-f812f241d9fd"
|
||||
```
|
||||
|
||||
- Add `src/infospace_bench/routing_config.py` (or extend
|
||||
`routing.py`) with `load_routing_config(path, *, workspace)` that
|
||||
returns a `RoutingPolicy` (or `AdaptiveRoutingPolicy` when the
|
||||
config sets `quality_floor` or names a ledger) ready to hand to
|
||||
`RoutingAssistedGenerationAdapter`.
|
||||
- Provider construction:
|
||||
- `openrouter` → llm-connect `OpenRouterAdapter` with API key from
|
||||
`api_key_env` (default `OPENROUTER_API_KEY`)
|
||||
- `claude_code` → llm-connect `ClaudeCodeAdapter`
|
||||
- others (openai, gemini) supported but explicitly documented as
|
||||
untested for production use
|
||||
- Tests: builds a static policy from a minimal config; builds an
|
||||
adaptive policy with a ledger; missing API key raises before any
|
||||
network call.
|
||||
|
||||
### T03 — `--provider routing` and `--routing-config` CLI flags
|
||||
|
||||
```task
|
||||
id: IB-WP-0020-T03
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "fe5888e0-da33-413a-b026-71ed811b8c73"
|
||||
```
|
||||
|
||||
- Add `routing` to the `--provider` choices on `generate run`,
|
||||
`generate resume`, and `generate from-source`.
|
||||
- Add `--routing-config <path>` (required when `--provider routing`).
|
||||
- Add `--quality-floor <float>` to override the config-level floor at
|
||||
the call site (handy for tightening or loosening for a single run
|
||||
without editing the file).
|
||||
- Wire the loader into `_adapter_for`/`run_generation` so a
|
||||
`RoutingAssistedGenerationAdapter` is constructed and passed to the
|
||||
workflow engine.
|
||||
- Tests: CLI smoke that builds a routing config pointing at mocked
|
||||
adapter ids and confirms the run goes through the bridge.
|
||||
|
||||
### T04 — Example config and live-smoke wiring
|
||||
|
||||
```task
|
||||
id: IB-WP-0020-T04
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "69288131-f265-4db5-a4b0-b0c8a6f55dd8"
|
||||
```
|
||||
|
||||
- Add `examples/routing/trading-literature.yaml` with a realistic
|
||||
Lefevre-aimed config: cheap model for summaries, mid model for
|
||||
entities/relations, ClaudeCode baseline behind a shadow sampler.
|
||||
- Update the optional live-OpenRouter smoke test
|
||||
(`tests/test_openrouter_live.py`) with a parallel skipped test that
|
||||
exercises `--provider routing` end-to-end when both
|
||||
`OPENROUTER_API_KEY` and
|
||||
`INFOSPACE_BENCH_ENABLE_LIVE_OPENROUTER=1` are set.
|
||||
- Document how to run the live routing smoke in
|
||||
`docs/generic-source-generator.md`.
|
||||
|
||||
### T05 — Shadow-mode opt-in flag
|
||||
|
||||
```task
|
||||
id: IB-WP-0020-T05
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "02658420-056c-4d73-8055-e6a7ab51876b"
|
||||
```
|
||||
|
||||
- Add `--shadow-rate <float>` and `--shadow-baseline <id>` flags so a
|
||||
caller can enable `wrap_with_shadow_sampling()` for an entire run
|
||||
without editing the config file. When set, the loader wraps each
|
||||
candidate adapter in `ShadowingAdapter` with the named baseline and
|
||||
the chosen rate.
|
||||
- Tests: monkeypatched baseline asserts the shadow path fires at
|
||||
`shadow_rate=1.0` and skips at `shadow_rate=0.0`.
|
||||
|
||||
## Acceptance
|
||||
|
||||
- `infospace-bench generate from-source ... --provider routing
|
||||
--routing-config <path>` succeeds against the deterministic Lefevre
|
||||
fixture with a hand-crafted routing config and mocked adapters.
|
||||
- The generation report's `## Per-stage adapter choices` section
|
||||
reflects the routed choices, and `output/budget/usage.yaml` buckets
|
||||
reflect the actual model that ran each call.
|
||||
- The static `openrouter` and `fixture` provider paths remain
|
||||
unchanged.
|
||||
- An optional live smoke test exists and is gated identically to the
|
||||
IB-WP-0016 OpenRouter smoke.
|
||||
- Documentation explains the config shape, the API-key resolution, and
|
||||
the difference between adaptive routing and shadow-mode sampling.
|
||||
|
||||
## Risks and open questions
|
||||
|
||||
- **Adapter constructor surface.** llm-connect's adapter constructors
|
||||
vary slightly per provider; the loader needs to keep a small but
|
||||
explicit allowlist of provider names rather than reflective magic.
|
||||
- **API key plumbing.** Today `openrouter` reads
|
||||
`OPENROUTER_API_KEY` directly. The config will name the env var
|
||||
explicitly to make multi-key setups workable; no key material
|
||||
belongs in the config file itself.
|
||||
- **Schema versioning.** Bump `schema_version` from day one so the
|
||||
loader can refuse mismatched configs once the shape stabilises.
|
||||
- **Shadow grader choice.** v1 will default the shadow grader to
|
||||
`ExactMatchJudge` because it has no extra cost. `LLMJudge` and
|
||||
`EmbeddingSimilarityJudge` configuration belongs in a follow-up.
|
||||
|
||||
## Downstream effects
|
||||
|
||||
- `infospace-bench routing ledger <path>` (already shipped via
|
||||
IB-WP-0018) becomes the natural companion CLI for inspecting the
|
||||
observations the routed runs accumulate.
|
||||
- A successful T03 + T04 lets us run a multi-chapter Lefevre live
|
||||
build using the adaptive router and validate the IB-WP-0016
|
||||
reviewer checklist on real output without single-model lock-in.
|
||||
Reference in New Issue
Block a user