Files
tegwick e3e5b8ecc1
Some checks failed
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
feat(infospace): systematic long-text processing — rich commit bodies, per-source eval/classify, chapters view
Three coordinated changes that let the pipeline produce a clean
chapter-by-chapter git history on long texts without archaeology after
the fact.

1. Richer commit messages. `SourcePipeline._git_commit` now diffs the
   staged changes, buckets added files by output subdirectory (entities,
   evaluations, classifications, mappings, analyses, metrics, logs), and
   includes counts in the commit body. So `git log` reads "entities:
   +23, evaluations: +23" per chapter instead of the same generic blurb
   on every commit. Zero behaviour change when no output changed; falls
   back to the original message if the diff query fails.

2. --eval-after-source / --classify-after-source on `infospace process`.
   After a source's stages succeed, the pipeline identifies which entity
   files are *new* (set diff of entity slugs before vs after), loads
   their EntityMeta, and runs per-entity evaluation and/or
   classification scoped to just those slugs before the per-source git
   commit lands. Result: each chapter's commit is self-contained —
   extraction + evaluation + classification in one atomic unit. Gated
   behind explicit flags because the cost is real (LLM latency per
   chapter rather than amortised across one bulk batch).

3. `markitect infospace chapters` subcommand. Lists source files in
   canonical order with entity count, evaluated count, classified
   count, and mean per-entity score per source. Text or JSON output.
   Natural triage surface for long-text infospaces — spot chapters that
   under-extracted or evaluated poorly.

Also: `docs/advanced-usage.md` gets a new "Systematic processing of
long texts" section with the recommended flag combo and the tradeoff
note on cost.

11 new unit tests cover the chapters command (text/json/no-sources),
the process flag wiring (help + provider requirement), and the
commit-body bucket logic. Full infospace+llm unit suite (315 tests)
green; 3 pre-existing infospace failures unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 08:24:26 +02:00
..

This example provides a tutorial and reference experiment for how to set up a viable infospace with history using markitect.

The task is to capture the knowledge from Adam Smith's The Wealth of Nations available digitally in the public domain as a transcript of the original text and transform and extend it to a collection of concepts and entities from a systems theoretical point of view based on Stafford Beer's Viable System Model that is consistent and complete.

The tutorial should explain how to use the concept of schemas to provide a scaffolding for how to structure the necessary information entities and define a set of prompts and instructions using the prompt dependency resolution infrastructure to incrementally inject chapters of the book.

The information space should utilize the option of keeping changes as git history. And define metrics for completeness and consistency.

While running the experiment no changes must be made to the markitect infrastructure.

If demand for optimization or fixing errors occurs, a list of corresponding tasks should be generated. It will be used to optimize the markitect infrastructure to then rerun the experiment to optimize tooling and infospace over time and again.

--worsch, 10th Feb. 2026