coulomb/markitect-main

Fork 0

Files

tegwick e3e5b8ecc1

Test Suite / unit-tests (3.11) (push) Has been cancelled

Details

Test Suite / unit-tests (3.12) (push) Has been cancelled

Details

Test Suite / integration-tests (push) Has been cancelled

Details

Test Suite / e2e-tests (push) Has been cancelled

Details

Test Suite / performance-tests (push) Has been cancelled

Details

Test Suite / code-quality (push) Has been cancelled

Details

Test Suite / security-scan (push) Has been cancelled

Details

Test Suite / test-summary (push) Has been cancelled

Details

feat(infospace): systematic long-text processing — rich commit bodies, per-source eval/classify, chapters view

Three coordinated changes that let the pipeline produce a clean
chapter-by-chapter git history on long texts without archaeology after
the fact.

1. Richer commit messages. `SourcePipeline._git_commit` now diffs the
   staged changes, buckets added files by output subdirectory (entities,
   evaluations, classifications, mappings, analyses, metrics, logs), and
   includes counts in the commit body. So `git log` reads "entities:
   +23, evaluations: +23" per chapter instead of the same generic blurb
   on every commit. Zero behaviour change when no output changed; falls
   back to the original message if the diff query fails.

2. --eval-after-source / --classify-after-source on `infospace process`.
   After a source's stages succeed, the pipeline identifies which entity
   files are *new* (set diff of entity slugs before vs after), loads
   their EntityMeta, and runs per-entity evaluation and/or
   classification scoped to just those slugs before the per-source git
   commit lands. Result: each chapter's commit is self-contained —
   extraction + evaluation + classification in one atomic unit. Gated
   behind explicit flags because the cost is real (LLM latency per
   chapter rather than amortised across one bulk batch).

3. `markitect infospace chapters` subcommand. Lists source files in
   canonical order with entity count, evaluated count, classified
   count, and mean per-entity score per source. Text or JSON output.
   Natural triage surface for long-text infospaces — spot chapters that
   under-extracted or evaluated poorly.

Also: `docs/advanced-usage.md` gets a new "Systematic processing of
long texts" section with the recommended flag combo and the tradeoff
note on cost.

11 new unit tests cover the chapters command (text/json/no-sources),
the process flag wiring (help + provider requirement), and the
commit-body bucket logic. Full infospace+llm unit suite (315 tests)
green; 3 pre-existing infospace failures unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-22 08:24:26 +02:00

7.8 KiB

Raw Blame History

Advanced Usage — Wealth of Nations Infospace

Patterns for working with the WoN infospace (988 entities) after the initial pipeline run. Every command in this file has been run against the actual infospace at the time of writing (2026-04-21); output shapes are excerpted verbatim.

All commands assume cwd = examples/infospace-with-history and the markitect-venv Python environment.

1. Incremental evaluation — add entities after the initial run

markitect infospace evaluate writes one file per entity under output/evaluations/<slug>.md. It skips any entity whose evaluation file already exists, so re-running after adding a new entity processes only the new one.

# Add a new entity file
vim output/entities/new-concept.md

# Evaluate only the new entity (explicit)
markitect infospace evaluate --entity new-concept --provider openrouter

# Or re-run the whole pass — existing 988 are skipped, only the new file hits the LLM
markitect infospace evaluate --provider openrouter

How skip detection works. Evaluation slugs are normalised to underscores with _s_ preserving apostrophes (farmers-capital entity → farmer_s_capital.md evaluation). If a new entity slug collides with an existing evaluation under this normalisation, the eval will be skipped. To be sure an entity was picked up, check:

# Count entities vs evaluations
ls output/entities/*.md | grep -Ev 'book-[0-9]+-(chapter-[0-9]+|introduction)-' | wc -l
ls output/evaluations/*.md | wc -l

2. Re-evaluating after guideline changes

evaluate has no --force flag; re-evaluation requires deleting the existing file first.

# Re-evaluate a single entity after updating the evaluation rubric
rm output/evaluations/accumulation_of_stock.md
markitect infospace evaluate --entity accumulation-of-stock --provider openrouter

# Re-evaluate a whole chapter
ls output/entities/book-1-chapter-06-entities.md   # see which entities the chapter produced
# Map chapter entities to eval filenames (apostrophe/underscore normalisation) and rm them

After re-evaluating, refresh the aggregate:

markitect infospace eval-summary --update-metrics

This merges per_entity_mean into output/metrics/metrics.yaml so the next markitect infospace viability check reflects the new scores.

3. Interpreting per-entity score distributions

eval-summary shows the mean for each of the five evaluation dimensions plus the overall range:

$ markitect infospace eval-summary
Evaluation summary — 985 entities evaluated

  Dimension                        Mean
  --------------------------------------
  overall                         3.956
  definition_precision            3.620
  domain_placement                4.559
  explanatory_value               3.936
  source_grounding                4.358
  vsm_relevance                   3.305

  Range: 1.00 – 4.80

Interpretation:

overall above the 3.5 viability threshold → the collection passes per_entity_mean.
The lowest dimension (vsm_relevance = 3.305) is the weakest signal. If the collection is meant to be VSM-grounded, this is the dimension most worth improving (via sharper entity definitions or schema changes).
A wide range (1.00 – 4.80) tells you there are outliers at both ends — worth triaging (see pattern 4).

4. Triaging low scorers

markitect infospace entities --by-type prints each entity's star score in-line:

$ markitect infospace entities --by-type | head
=== Element (315 entities) ===
  active_and_productive_stock              Accumulation       S1   ★4.6
  advanced_state_of_society                General Theory     S5
  agio_of_bank_money                       Exchange           S2   ★4.8

Entities with no ★ have no evaluation yet. To list the lowest-scoring entities across the whole collection:

# Extract overall_score from every evaluation file and sort ascending
for f in output/evaluations/*.md; do
  score=$(awk '/^overall_score:/ {print $2; exit}' "$f")
  printf "%s\t%s\n" "$score" "$(basename "$f" .md)"
done | sort -n | head -20

The 20 lowest scorers are the natural triage list — inspect their output/entities/<slug>.md and evaluation rationales to decide whether to refine the entity, merge it with a better-formed neighbour, or drop it.

5. Reading and acting on collection-check output

markitect infospace check runs five concerns (C1–C5). Use --concern to focus on one and --json for machine-readable output:

# Redundancy — which pairs of entities are suspiciously similar?
markitect infospace check --concern redundancy --json

{
  "redundancy": {
    "concern": "C1",
    "redundancy_ratio": 0.0061,
    "similar_pairs": [
      {"entity_a": "bank_economic_contribution_metrics",
       "entity_b": "bank_economic_development_metrics",
       "similarity": 1.0, "method": "word_overlap"},
      {"entity_a": "economic_system_objectives",
       "entity_b": "economic_system_purpose",
       "similarity": 0.9394, "method": "word_overlap"}
    ]
  }
}

Acting on this:

Similarity = 1.0 is almost certainly a duplicate — pick one slug and merge or delete the other.
0.85–0.99 usually means two entities genuinely cover the same idea with slight phrasing differences. Merging is the cleanest fix.
< 0.85 usually represents legitimate adjacent concepts — leave as-is unless the definition rubric says otherwise.

For coverage and coherence, the pattern is the same: the --json output surfaces the specific entities / missing links / disconnected components you need to look at, rather than a bare ratio.

5. Systematic processing of long texts

For long source material (books, multi-chapter specifications, corpora), the pipeline can produce a clean chapter-by-chapter git history on its own if you let it. The pattern:

# Process all sources in canonical order, eval and classify per chapter,
# snapshot metrics after each chapter.
markitect infospace process --all \
    --provider openrouter \
    --eval-after-source \
    --classify-after-source \
    --check-after-each

What you get:

One commit per source file, not per batch run. The commit message body lists counts by bucket (entities: +23, evaluations: +23, classifications: +23) derived from the actual staged diff, so git log reads like the story of the infospace growing.
Chapter-atomic commits. --eval-after-source and --classify-after-source evaluate and classify only the new entities from the just-processed source before the commit lands, so each commit is a self-contained chapter snapshot.
Metrics-per-chapter trail. --check-after-each appends a snapshot to output/metrics/history.yaml after every chapter, so markitect infospace history later shows the metric trajectory rather than just start/end.

Cost tradeoff. --eval-after-source pays LLM latency per chapter rather than amortising it across one bulk batch. It's worth it when you care about the git history or want early quality signal, not when you're bulk-backfilling a known-good corpus.

Triage during the run. While processing, use markitect infospace chapters in another shell to see per-source entity/eval/classify counts and mean scores — handy for spotting chapters that under-extracted or evaluated poorly.

$ markitect infospace chapters
source               entities  evaluated  classified  mean_score
-------------------  --------  ---------  ----------  ----------
book-1-chapter-01    96        96         79          4.22
book-1-chapter-02    16        16         10          4.06
…

7.8 KiB Raw Blame History Unescape Escape

Advanced Usage — Wealth of Nations Infospace

1. Incremental evaluation — add entities after the initial run

2. Re-evaluating after guideline changes

3. Interpreting per-entity score distributions

4. Triaging low scorers

5. Reading and acting on collection-check output

5. Systematic processing of long texts

See also

7.8 KiB

Raw Blame History