Files
markitect-main/examples/infospace-with-history/docs/performance-notes.md
tegwick 36a5136bdf docs(infospace): add advanced-usage, composition guide, and performance notes (C.4/C.5/C.6)
Closes out three docs tasks from roadmap/infospace-s3-closeout/PLAN.md:

- examples/infospace-with-history/docs/advanced-usage.md (C.4) — 5 worked
  patterns covering incremental eval, re-eval workflow (no --force flag
  exists; documents the rm-then-re-run pattern instead), interpreting the
  eval-summary distribution, triaging low scorers via an awk pipeline
  over overall_score (since `entities --sort-by score` does not exist),
  and acting on check --json output.
- docs/composition-guide.md (C.5) — walks through how supply-chain-vsm
  binds WoN as a discipline, then a step-by-step for creating a new
  infospace that binds an existing one. Includes live output from
  `markitect infospace disciplines`.
- examples/infospace-with-history/docs/performance-notes.md (C.6) — cites
  the 6h 28m wall time of the 985-entity S3.3 batch, ~2.5 ent/min rate,
  ~2000–3000 tokens/entity estimate, word_overlap vs embedding backend
  for redundancy checks, and a provider-by-scale recommendation table.

All commands in these docs were run against the live infospace at
commit time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-21 07:02:46 +02:00

4.5 KiB
Raw Blame History

Performance Notes — Wealth of Nations Infospace

Observed timings, file sizes, and provider choices from the 988-entity WoN example. These are operational notes, not a benchmark — numbers come from the actual S3.3 evaluation run (2026-02-23) rather than a controlled experiment.


Evaluation batch duration

The initial evaluation pass produced 985 output/evaluations/*.md files:

  • First evaluated_at: 2026-02-23T00:11:52
  • Last evaluated_at: 2026-02-23T06:39:45
  • Total wall time: ~6h 28m
  • Effective throughput: ~2.5 entities/min (~152 entities/hour)

Extracted from evaluation frontmatter:

grep -h '^evaluated_at:' output/evaluations/*.md | sort | sed -n '1p;$p'

Caveats:

  • This was against OpenRouter's free tier, which applies implicit rate-limiting and occasional retries.
  • Throughput is not constant — gaps between bursts show up as plateaus when you plot the timestamps.
  • The batch was not fully parallelised; a tuned concurrent client could likely 24× this throughput on a paid OpenRouter tier.

Tokens per entity (estimate)

Direct token counts are not logged in the evaluation files, but the inputs and outputs are on disk:

  • Input per request: evaluation schema (~3.7 KB) + entity file (~0.7 KB median) + fixed system prompt ≈ ~15002500 tokens in
  • Output per request: structured evaluation with 5 dimensions and rationales, median eval file 3.6 KB ≈ ~600800 tokens out
  • Round-trip total: ~20003000 tokens per entity
  • Batch total estimate: 985 entities × ~2500 tokens ≈ ~2.5M tokens for the full pass

The constant per-entity input means the cheapest way to reduce spend on a re-run is to narrow the targeted entities (--entity <slug> or --chapter <n>), not to shorten the schema.


Embedding cache and collection checks

markitect infospace check --concern redundancy supports two similarity backends (see markitect/infospace/checks/redundancy.py):

  • word_overlap — the default, used when no embeddings are provided. Pure-Python set intersection over tokenised entity text. No LLM calls, no cache needed. This is what the current WoN check runs.
  • embedding — active when a pre-computed {slug: vector} mapping is passed in. No persistent on-disk embedding cache exists today; the caller is responsible for computing and supplying the vectors.

Implication: the 988-entity check runs in seconds because it's all word-overlap. Switching to embedding similarity would add an embedding API pass (another ~988 requests) which is currently a manual step outside the CLI.


Provider choice — recommendation

For the WoN dataset specifically (text-heavy entities, 5-dimension rubric):

Scale Recommended provider Rationale
< 50 entities gemini/gemini-2.5-flash Fast default; free tier is generous enough; consistent with markitect llm-check out of the box.
50 1000 entities openrouter with a :free model (e.g. arcee-ai/trinity-large-preview:free) What the S3.3 batch used; gets through 988 entities in one overnight run without cost.
> 1000 entities openrouter with a paid small-context model, or openai Free-tier rate limits start to dominate wall time; paying for higher concurrency is cheaper than calendar time.

All providers are accepted by markitect infospace evaluate --provider. The evaluation schema doesn't assume any provider-specific features.

Note on provider mixing: if part of a collection is evaluated under one provider/model and the rest under another, per_entity_mean can drift slightly (different models calibrate scores differently). For the viability threshold of 3.5 the drift is usually negligible, but for fine-grained outlier analysis prefer a single provider per batch.


What is not measured here

  • End-to-end pipeline time (entity extraction from raw chapters, classification, relation graph) — only the evaluation phase is timed.
  • Memory footprint — the full in-memory state for 988 entities is small (< 200 MB observed), but not systematically measured.
  • Failure/retry rates — the 985 vs 988 gap is three entities the original run missed (plus one added later); no structured retry log was kept.

Expanding any of these into a proper benchmark is out of scope for the WoN example and should live alongside a synthetic corpus that can be regenerated deterministically.