Three coordinated changes that let the pipeline produce a clean
chapter-by-chapter git history on long texts without archaeology after
the fact.
1. Richer commit messages. `SourcePipeline._git_commit` now diffs the
staged changes, buckets added files by output subdirectory (entities,
evaluations, classifications, mappings, analyses, metrics, logs), and
includes counts in the commit body. So `git log` reads "entities:
+23, evaluations: +23" per chapter instead of the same generic blurb
on every commit. Zero behaviour change when no output changed; falls
back to the original message if the diff query fails.
2. --eval-after-source / --classify-after-source on `infospace process`.
After a source's stages succeed, the pipeline identifies which entity
files are *new* (set diff of entity slugs before vs after), loads
their EntityMeta, and runs per-entity evaluation and/or
classification scoped to just those slugs before the per-source git
commit lands. Result: each chapter's commit is self-contained —
extraction + evaluation + classification in one atomic unit. Gated
behind explicit flags because the cost is real (LLM latency per
chapter rather than amortised across one bulk batch).
3. `markitect infospace chapters` subcommand. Lists source files in
canonical order with entity count, evaluated count, classified
count, and mean per-entity score per source. Text or JSON output.
Natural triage surface for long-text infospaces — spot chapters that
under-extracted or evaluated poorly.
Also: `docs/advanced-usage.md` gets a new "Systematic processing of
long texts" section with the recommended flag combo and the tradeoff
note on cost.
11 new unit tests cover the chapters command (text/json/no-sources),
the process flag wiring (help + provider requirement), and the
commit-body bucket logic. Full infospace+llm unit suite (315 tests)
green; 3 pre-existing infospace failures unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Free-tier APIs intermittently return invalid JSON or empty responses.
Now any exception in _call_llm retries up to 3 times with a 5s back-off,
rather than failing immediately on non-rate-limit errors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- PipelineStage now supports max_tokens to override the 4096 default
- SourcePipeline records provider/model on each entity file as HTML comment
- output/processing-log.yaml tracks tokens, cost, duration, retries, errors
- _call_llm returns (content, metadata) for downstream traceability
- _http.py wraps JSON parse errors with body preview for debugging
- infospace.yaml stages: extract/map=6000 tokens, synthesize=3000 tokens
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SourcePipeline: retry split_entities stage once when 0 entity delimiters
are found (free-tier models intermittently return short non-formatted
responses); save raw LLM response to <stage>-raw.md alongside prompts
- Return None (pause pipeline) rather than writing empty view file when
no entities found after max retries
- _http.py: wrap json.JSONDecodeError in LLMAPIError with body preview
- extract-entities.md: add explicit H2-heading format example to Output
Format section to prevent models from using inline "Section:" format
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extend PipelineStage with name, output_dir, output_macro,
split_entities, and macros fields for declarative pipeline config
- Add SourcePipeline class (pipeline.py) using simple @{macro}
substitution — no SQLite dependency, skip-if-exists per stage,
LLM retry on rate limits, git commit per source
- Add `markitect infospace process [GLOB_PATTERN]` CLI command with
--all, --provider, --model, --check-after-each, --no-commit flags
- Update infospace.yaml with output_dir, output_macro, split_entities,
and macros for each pipeline stage in the WoN example
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>