Three coordinated changes that let the pipeline produce a clean
chapter-by-chapter git history on long texts without archaeology after
the fact.
1. Richer commit messages. `SourcePipeline._git_commit` now diffs the
staged changes, buckets added files by output subdirectory (entities,
evaluations, classifications, mappings, analyses, metrics, logs), and
includes counts in the commit body. So `git log` reads "entities:
+23, evaluations: +23" per chapter instead of the same generic blurb
on every commit. Zero behaviour change when no output changed; falls
back to the original message if the diff query fails.
2. --eval-after-source / --classify-after-source on `infospace process`.
After a source's stages succeed, the pipeline identifies which entity
files are *new* (set diff of entity slugs before vs after), loads
their EntityMeta, and runs per-entity evaluation and/or
classification scoped to just those slugs before the per-source git
commit lands. Result: each chapter's commit is self-contained —
extraction + evaluation + classification in one atomic unit. Gated
behind explicit flags because the cost is real (LLM latency per
chapter rather than amortised across one bulk batch).
3. `markitect infospace chapters` subcommand. Lists source files in
canonical order with entity count, evaluated count, classified
count, and mean per-entity score per source. Text or JSON output.
Natural triage surface for long-text infospaces — spot chapters that
under-extracted or evaluated poorly.
Also: `docs/advanced-usage.md` gets a new "Systematic processing of
long texts" section with the recommended flag combo and the tradeoff
note on cost.
11 new unit tests cover the chapters command (text/json/no-sources),
the process flag wiring (help + provider requirement), and the
commit-body bucket logic. Full infospace+llm unit suite (315 tests)
green; 3 pre-existing infospace failures unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes out three docs tasks from roadmap/infospace-s3-closeout/PLAN.md:
- examples/infospace-with-history/docs/advanced-usage.md (C.4) — 5 worked
patterns covering incremental eval, re-eval workflow (no --force flag
exists; documents the rm-then-re-run pattern instead), interpreting the
eval-summary distribution, triaging low scorers via an awk pipeline
over overall_score (since `entities --sort-by score` does not exist),
and acting on check --json output.
- docs/composition-guide.md (C.5) — walks through how supply-chain-vsm
binds WoN as a discipline, then a step-by-step for creating a new
infospace that binds an existing one. Includes live output from
`markitect infospace disciplines`.
- examples/infospace-with-history/docs/performance-notes.md (C.6) — cites
the 6h 28m wall time of the 985-entity S3.3 batch, ~2.5 ent/min rate,
~2000–3000 tokens/entity estimate, word_overlap vs embedding backend
for redundancy checks, and a provider-by-scale recommendation table.
All commands in these docs were run against the live infospace at
commit time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>