# Performance Notes — Wealth of Nations Infospace Observed timings, file sizes, and provider choices from the 988-entity WoN example. These are **operational notes**, not a benchmark — numbers come from the actual S3.3 evaluation run (2026-02-23) rather than a controlled experiment. --- ## Evaluation batch duration The initial evaluation pass produced 985 `output/evaluations/*.md` files: - First `evaluated_at`: `2026-02-23T00:11:52` - Last `evaluated_at`: `2026-02-23T06:39:45` - **Total wall time: ~6h 28m** - **Effective throughput: ~2.5 entities/min** (~152 entities/hour) Extracted from evaluation frontmatter: ```bash grep -h '^evaluated_at:' output/evaluations/*.md | sort | sed -n '1p;$p' ``` Caveats: - This was against OpenRouter's free tier, which applies implicit rate-limiting and occasional retries. - Throughput is not constant — gaps between bursts show up as plateaus when you plot the timestamps. - The batch was not fully parallelised; a tuned concurrent client could likely 2–4× this throughput on a paid OpenRouter tier. --- ## Tokens per entity (estimate) Direct token counts are not logged in the evaluation files, but the inputs and outputs are on disk: - **Input per request**: evaluation schema (~3.7 KB) + entity file (~0.7 KB median) + fixed system prompt ≈ **~1500–2500 tokens in** - **Output per request**: structured evaluation with 5 dimensions and rationales, median eval file 3.6 KB ≈ **~600–800 tokens out** - **Round-trip total**: **~2000–3000 tokens per entity** - **Batch total estimate**: 985 entities × ~2500 tokens ≈ **~2.5M tokens** for the full pass The constant per-entity input means the cheapest way to reduce spend on a re-run is to narrow the targeted entities (`--entity ` or `--chapter `), not to shorten the schema. --- ## Embedding cache and collection checks `markitect infospace check --concern redundancy` supports two similarity backends (see `markitect/infospace/checks/redundancy.py`): - **`word_overlap`** — the default, used when no embeddings are provided. Pure-Python set intersection over tokenised entity text. **No LLM calls, no cache needed.** This is what the current WoN check runs. - **`embedding`** — active when a pre-computed `{slug: vector}` mapping is passed in. No persistent on-disk embedding cache exists today; the caller is responsible for computing and supplying the vectors. Implication: the 988-entity `check` runs in seconds because it's all word-overlap. Switching to embedding similarity would add an embedding API pass (another ~988 requests) which is currently a manual step outside the CLI. --- ## Provider choice — recommendation For the WoN dataset specifically (text-heavy entities, 5-dimension rubric): | Scale | Recommended provider | Rationale | |-----------------------|----------------------------------|-----------| | < 50 entities | `gemini/gemini-2.5-flash` | Fast default; free tier is generous enough; consistent with `markitect llm-check` out of the box. | | 50 – 1000 entities | `openrouter` with a `:free` model (e.g. `arcee-ai/trinity-large-preview:free`) | What the S3.3 batch used; gets through 988 entities in one overnight run without cost. | | > 1000 entities | `openrouter` with a paid small-context model, or `openai` | Free-tier rate limits start to dominate wall time; paying for higher concurrency is cheaper than calendar time. | All providers are accepted by `markitect infospace evaluate --provider`. The evaluation schema doesn't assume any provider-specific features. Note on provider mixing: if part of a collection is evaluated under one provider/model and the rest under another, `per_entity_mean` can drift slightly (different models calibrate scores differently). For the viability threshold of 3.5 the drift is usually negligible, but for fine-grained outlier analysis prefer a single provider per batch. --- ## What is *not* measured here - **End-to-end pipeline time** (entity extraction from raw chapters, classification, relation graph) — only the evaluation phase is timed. - **Memory footprint** — the full in-memory state for 988 entities is small (< 200 MB observed), but not systematically measured. - **Failure/retry rates** — the 985 vs 988 gap is three entities the original run missed (plus one added later); no structured retry log was kept. Expanding any of these into a proper benchmark is **out of scope** for the WoN example and should live alongside a synthetic corpus that can be regenerated deterministically.