Files
markitect-main/examples/infospace-with-history/docs/performance-notes.md
tegwick 36a5136bdf docs(infospace): add advanced-usage, composition guide, and performance notes (C.4/C.5/C.6)
Closes out three docs tasks from roadmap/infospace-s3-closeout/PLAN.md:

- examples/infospace-with-history/docs/advanced-usage.md (C.4) — 5 worked
  patterns covering incremental eval, re-eval workflow (no --force flag
  exists; documents the rm-then-re-run pattern instead), interpreting the
  eval-summary distribution, triaging low scorers via an awk pipeline
  over overall_score (since `entities --sort-by score` does not exist),
  and acting on check --json output.
- docs/composition-guide.md (C.5) — walks through how supply-chain-vsm
  binds WoN as a discipline, then a step-by-step for creating a new
  infospace that binds an existing one. Includes live output from
  `markitect infospace disciplines`.
- examples/infospace-with-history/docs/performance-notes.md (C.6) — cites
  the 6h 28m wall time of the 985-entity S3.3 batch, ~2.5 ent/min rate,
  ~2000–3000 tokens/entity estimate, word_overlap vs embedding backend
  for redundancy checks, and a provider-by-scale recommendation table.

All commands in these docs were run against the live infospace at
commit time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-21 07:02:46 +02:00

107 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Performance Notes — Wealth of Nations Infospace
Observed timings, file sizes, and provider choices from the 988-entity WoN
example. These are **operational notes**, not a benchmark — numbers come
from the actual S3.3 evaluation run (2026-02-23) rather than a controlled
experiment.
---
## Evaluation batch duration
The initial evaluation pass produced 985 `output/evaluations/*.md` files:
- First `evaluated_at`: `2026-02-23T00:11:52`
- Last `evaluated_at`: `2026-02-23T06:39:45`
- **Total wall time: ~6h 28m**
- **Effective throughput: ~2.5 entities/min** (~152 entities/hour)
Extracted from evaluation frontmatter:
```bash
grep -h '^evaluated_at:' output/evaluations/*.md | sort | sed -n '1p;$p'
```
Caveats:
- This was against OpenRouter's free tier, which applies implicit
rate-limiting and occasional retries.
- Throughput is not constant — gaps between bursts show up as plateaus
when you plot the timestamps.
- The batch was not fully parallelised; a tuned concurrent client could
likely 24× this throughput on a paid OpenRouter tier.
---
## Tokens per entity (estimate)
Direct token counts are not logged in the evaluation files, but the
inputs and outputs are on disk:
- **Input per request**: evaluation schema (~3.7 KB) + entity file
(~0.7 KB median) + fixed system prompt ≈ **~15002500 tokens in**
- **Output per request**: structured evaluation with 5 dimensions and
rationales, median eval file 3.6 KB ≈ **~600800 tokens out**
- **Round-trip total**: **~20003000 tokens per entity**
- **Batch total estimate**: 985 entities × ~2500 tokens ≈ **~2.5M tokens**
for the full pass
The constant per-entity input means the cheapest way to reduce spend on a
re-run is to narrow the targeted entities (`--entity <slug>` or
`--chapter <n>`), not to shorten the schema.
---
## Embedding cache and collection checks
`markitect infospace check --concern redundancy` supports two similarity
backends (see `markitect/infospace/checks/redundancy.py`):
- **`word_overlap`** — the default, used when no embeddings are provided.
Pure-Python set intersection over tokenised entity text. **No LLM calls,
no cache needed.** This is what the current WoN check runs.
- **`embedding`** — active when a pre-computed `{slug: vector}` mapping is
passed in. No persistent on-disk embedding cache exists today; the
caller is responsible for computing and supplying the vectors.
Implication: the 988-entity `check` runs in seconds because it's all
word-overlap. Switching to embedding similarity would add an embedding
API pass (another ~988 requests) which is currently a manual step
outside the CLI.
---
## Provider choice — recommendation
For the WoN dataset specifically (text-heavy entities, 5-dimension
rubric):
| Scale | Recommended provider | Rationale |
|-----------------------|----------------------------------|-----------|
| < 50 entities | `gemini/gemini-2.5-flash` | Fast default; free tier is generous enough; consistent with `markitect llm-check` out of the box. |
| 50 1000 entities | `openrouter` with a `:free` model (e.g. `arcee-ai/trinity-large-preview:free`) | What the S3.3 batch used; gets through 988 entities in one overnight run without cost. |
| > 1000 entities | `openrouter` with a paid small-context model, or `openai` | Free-tier rate limits start to dominate wall time; paying for higher concurrency is cheaper than calendar time. |
All providers are accepted by `markitect infospace evaluate --provider`.
The evaluation schema doesn't assume any provider-specific features.
Note on provider mixing: if part of a collection is evaluated under one
provider/model and the rest under another, `per_entity_mean` can drift
slightly (different models calibrate scores differently). For the
viability threshold of 3.5 the drift is usually negligible, but for
fine-grained outlier analysis prefer a single provider per batch.
---
## What is *not* measured here
- **End-to-end pipeline time** (entity extraction from raw chapters,
classification, relation graph) — only the evaluation phase is timed.
- **Memory footprint** — the full in-memory state for 988 entities is
small (< 200 MB observed), but not systematically measured.
- **Failure/retry rates** — the 985 vs 988 gap is three entities the
original run missed (plus one added later); no structured retry log
was kept.
Expanding any of these into a proper benchmark is **out of scope** for
the WoN example and should live alongside a synthetic corpus that can be
regenerated deterministically.