docs(infospace): add advanced-usage, composition guide, and performance notes (C.4/C.5/C.6)
Closes out three docs tasks from roadmap/infospace-s3-closeout/PLAN.md: - examples/infospace-with-history/docs/advanced-usage.md (C.4) — 5 worked patterns covering incremental eval, re-eval workflow (no --force flag exists; documents the rm-then-re-run pattern instead), interpreting the eval-summary distribution, triaging low scorers via an awk pipeline over overall_score (since `entities --sort-by score` does not exist), and acting on check --json output. - docs/composition-guide.md (C.5) — walks through how supply-chain-vsm binds WoN as a discipline, then a step-by-step for creating a new infospace that binds an existing one. Includes live output from `markitect infospace disciplines`. - examples/infospace-with-history/docs/performance-notes.md (C.6) — cites the 6h 28m wall time of the 985-entity S3.3 batch, ~2.5 ent/min rate, ~2000–3000 tokens/entity estimate, word_overlap vs embedding backend for redundancy checks, and a provider-by-scale recommendation table. All commands in these docs were run against the live infospace at commit time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
106
examples/infospace-with-history/docs/performance-notes.md
Normal file
106
examples/infospace-with-history/docs/performance-notes.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# Performance Notes — Wealth of Nations Infospace
|
||||
|
||||
Observed timings, file sizes, and provider choices from the 988-entity WoN
|
||||
example. These are **operational notes**, not a benchmark — numbers come
|
||||
from the actual S3.3 evaluation run (2026-02-23) rather than a controlled
|
||||
experiment.
|
||||
|
||||
---
|
||||
|
||||
## Evaluation batch duration
|
||||
|
||||
The initial evaluation pass produced 985 `output/evaluations/*.md` files:
|
||||
|
||||
- First `evaluated_at`: `2026-02-23T00:11:52`
|
||||
- Last `evaluated_at`: `2026-02-23T06:39:45`
|
||||
- **Total wall time: ~6h 28m**
|
||||
- **Effective throughput: ~2.5 entities/min** (~152 entities/hour)
|
||||
|
||||
Extracted from evaluation frontmatter:
|
||||
```bash
|
||||
grep -h '^evaluated_at:' output/evaluations/*.md | sort | sed -n '1p;$p'
|
||||
```
|
||||
|
||||
Caveats:
|
||||
- This was against OpenRouter's free tier, which applies implicit
|
||||
rate-limiting and occasional retries.
|
||||
- Throughput is not constant — gaps between bursts show up as plateaus
|
||||
when you plot the timestamps.
|
||||
- The batch was not fully parallelised; a tuned concurrent client could
|
||||
likely 2–4× this throughput on a paid OpenRouter tier.
|
||||
|
||||
---
|
||||
|
||||
## Tokens per entity (estimate)
|
||||
|
||||
Direct token counts are not logged in the evaluation files, but the
|
||||
inputs and outputs are on disk:
|
||||
|
||||
- **Input per request**: evaluation schema (~3.7 KB) + entity file
|
||||
(~0.7 KB median) + fixed system prompt ≈ **~1500–2500 tokens in**
|
||||
- **Output per request**: structured evaluation with 5 dimensions and
|
||||
rationales, median eval file 3.6 KB ≈ **~600–800 tokens out**
|
||||
- **Round-trip total**: **~2000–3000 tokens per entity**
|
||||
- **Batch total estimate**: 985 entities × ~2500 tokens ≈ **~2.5M tokens**
|
||||
for the full pass
|
||||
|
||||
The constant per-entity input means the cheapest way to reduce spend on a
|
||||
re-run is to narrow the targeted entities (`--entity <slug>` or
|
||||
`--chapter <n>`), not to shorten the schema.
|
||||
|
||||
---
|
||||
|
||||
## Embedding cache and collection checks
|
||||
|
||||
`markitect infospace check --concern redundancy` supports two similarity
|
||||
backends (see `markitect/infospace/checks/redundancy.py`):
|
||||
|
||||
- **`word_overlap`** — the default, used when no embeddings are provided.
|
||||
Pure-Python set intersection over tokenised entity text. **No LLM calls,
|
||||
no cache needed.** This is what the current WoN check runs.
|
||||
- **`embedding`** — active when a pre-computed `{slug: vector}` mapping is
|
||||
passed in. No persistent on-disk embedding cache exists today; the
|
||||
caller is responsible for computing and supplying the vectors.
|
||||
|
||||
Implication: the 988-entity `check` runs in seconds because it's all
|
||||
word-overlap. Switching to embedding similarity would add an embedding
|
||||
API pass (another ~988 requests) which is currently a manual step
|
||||
outside the CLI.
|
||||
|
||||
---
|
||||
|
||||
## Provider choice — recommendation
|
||||
|
||||
For the WoN dataset specifically (text-heavy entities, 5-dimension
|
||||
rubric):
|
||||
|
||||
| Scale | Recommended provider | Rationale |
|
||||
|-----------------------|----------------------------------|-----------|
|
||||
| < 50 entities | `gemini/gemini-2.5-flash` | Fast default; free tier is generous enough; consistent with `markitect llm-check` out of the box. |
|
||||
| 50 – 1000 entities | `openrouter` with a `:free` model (e.g. `arcee-ai/trinity-large-preview:free`) | What the S3.3 batch used; gets through 988 entities in one overnight run without cost. |
|
||||
| > 1000 entities | `openrouter` with a paid small-context model, or `openai` | Free-tier rate limits start to dominate wall time; paying for higher concurrency is cheaper than calendar time. |
|
||||
|
||||
All providers are accepted by `markitect infospace evaluate --provider`.
|
||||
The evaluation schema doesn't assume any provider-specific features.
|
||||
|
||||
Note on provider mixing: if part of a collection is evaluated under one
|
||||
provider/model and the rest under another, `per_entity_mean` can drift
|
||||
slightly (different models calibrate scores differently). For the
|
||||
viability threshold of 3.5 the drift is usually negligible, but for
|
||||
fine-grained outlier analysis prefer a single provider per batch.
|
||||
|
||||
---
|
||||
|
||||
## What is *not* measured here
|
||||
|
||||
- **End-to-end pipeline time** (entity extraction from raw chapters,
|
||||
classification, relation graph) — only the evaluation phase is timed.
|
||||
- **Memory footprint** — the full in-memory state for 988 entities is
|
||||
small (< 200 MB observed), but not systematically measured.
|
||||
- **Failure/retry rates** — the 985 vs 988 gap is three entities the
|
||||
original run missed (plus one added later); no structured retry log
|
||||
was kept.
|
||||
|
||||
Expanding any of these into a proper benchmark is **out of scope** for
|
||||
the WoN example and should live alongside a synthetic corpus that can be
|
||||
regenerated deterministically.
|
||||
Reference in New Issue
Block a user