feat(infospace): systematic long-text processing — rich commit bodies, per-source eval/classify, chapters view
Some checks failed
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
Some checks failed
Test Suite / unit-tests (3.11) (push) Has been cancelled
Test Suite / unit-tests (3.12) (push) Has been cancelled
Test Suite / integration-tests (push) Has been cancelled
Test Suite / e2e-tests (push) Has been cancelled
Test Suite / performance-tests (push) Has been cancelled
Test Suite / code-quality (push) Has been cancelled
Test Suite / security-scan (push) Has been cancelled
Test Suite / test-summary (push) Has been cancelled
Three coordinated changes that let the pipeline produce a clean chapter-by-chapter git history on long texts without archaeology after the fact. 1. Richer commit messages. `SourcePipeline._git_commit` now diffs the staged changes, buckets added files by output subdirectory (entities, evaluations, classifications, mappings, analyses, metrics, logs), and includes counts in the commit body. So `git log` reads "entities: +23, evaluations: +23" per chapter instead of the same generic blurb on every commit. Zero behaviour change when no output changed; falls back to the original message if the diff query fails. 2. --eval-after-source / --classify-after-source on `infospace process`. After a source's stages succeed, the pipeline identifies which entity files are *new* (set diff of entity slugs before vs after), loads their EntityMeta, and runs per-entity evaluation and/or classification scoped to just those slugs before the per-source git commit lands. Result: each chapter's commit is self-contained — extraction + evaluation + classification in one atomic unit. Gated behind explicit flags because the cost is real (LLM latency per chapter rather than amortised across one bulk batch). 3. `markitect infospace chapters` subcommand. Lists source files in canonical order with entity count, evaluated count, classified count, and mean per-entity score per source. Text or JSON output. Natural triage surface for long-text infospaces — spot chapters that under-extracted or evaluated poorly. Also: `docs/advanced-usage.md` gets a new "Systematic processing of long texts" section with the recommended flag combo and the tradeoff note on cost. 11 new unit tests cover the chapters command (text/json/no-sources), the process flag wiring (help + provider requirement), and the commit-body bucket logic. Full infospace+llm unit suite (315 tests) green; 3 pre-existing infospace failures unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -171,6 +171,57 @@ you need to look at, rather than a bare ratio.
|
||||
|
||||
---
|
||||
|
||||
## 5. Systematic processing of long texts
|
||||
|
||||
For long source material (books, multi-chapter specifications, corpora), the
|
||||
pipeline can produce a clean chapter-by-chapter git history on its own if
|
||||
you let it. The pattern:
|
||||
|
||||
```bash
|
||||
# Process all sources in canonical order, eval and classify per chapter,
|
||||
# snapshot metrics after each chapter.
|
||||
markitect infospace process --all \
|
||||
--provider openrouter \
|
||||
--eval-after-source \
|
||||
--classify-after-source \
|
||||
--check-after-each
|
||||
```
|
||||
|
||||
What you get:
|
||||
|
||||
- **One commit per source file**, not per batch run. The commit message body
|
||||
lists counts by bucket (`entities: +23`, `evaluations: +23`,
|
||||
`classifications: +23`) derived from the actual staged diff, so `git log`
|
||||
reads like the story of the infospace growing.
|
||||
- **Chapter-atomic commits.** `--eval-after-source` and
|
||||
`--classify-after-source` evaluate and classify *only the new entities*
|
||||
from the just-processed source before the commit lands, so each commit is
|
||||
a self-contained chapter snapshot.
|
||||
- **Metrics-per-chapter trail.** `--check-after-each` appends a snapshot to
|
||||
`output/metrics/history.yaml` after every chapter, so `markitect infospace
|
||||
history` later shows the metric trajectory rather than just start/end.
|
||||
|
||||
**Cost tradeoff.** `--eval-after-source` pays LLM latency per chapter rather
|
||||
than amortising it across one bulk batch. It's worth it when you care about
|
||||
the git history or want early quality signal, not when you're bulk-backfilling
|
||||
a known-good corpus.
|
||||
|
||||
**Triage during the run.** While processing, use `markitect infospace
|
||||
chapters` in another shell to see per-source entity/eval/classify counts and
|
||||
mean scores — handy for spotting chapters that under-extracted or evaluated
|
||||
poorly.
|
||||
|
||||
```
|
||||
$ markitect infospace chapters
|
||||
source entities evaluated classified mean_score
|
||||
------------------- -------- --------- ---------- ----------
|
||||
book-1-chapter-01 96 96 79 4.22
|
||||
book-1-chapter-02 16 16 10 4.06
|
||||
…
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See also
|
||||
|
||||
- `METRICS-METHODOLOGY.md` — how each metric is computed.
|
||||
|
||||
Reference in New Issue
Block a user