infospace-bench/docs/evaluation-history-and-metrics.md

# Evaluation History And Metrics

`infospace-bench` keeps evaluation history as committed, inspectable files under
each infospace root. This replaces the legacy `markitect-project` history
workflow while retaining the useful behaviors: Markdown evaluation files,
append-only snapshot history, metric merging, and viability checks.

## Files

- `output/evaluations/*.md`: per-artifact evaluation files with YAML
  frontmatter and a human-readable Markdown body.
- `output/metrics/metrics.yaml`: latest merged metrics. Collection metrics,
  evaluation-derived metrics, and structured non-numeric values are preserved.
- `output/metrics/history.yaml`: append-only list of evaluation snapshots.
- `output/metrics/snapshots/<snapshot-id>.yaml`: named snapshot copies for
  reproducible diffs.
- `output/metrics/viability.yaml`: structured viability report generated from
  `infospace.yaml` thresholds and the current metrics file.

## Replacement Mapping

The old infospace history code used entity-oriented names such as
`entity_count`, `entity_evaluations`, and `entity_slug`. The successor model
uses artifact-oriented names:

- `artifact_count` replaces `entity_count`
- `artifact_evaluations` replaces `entity_evaluations`
- `artifact_id` replaces `entity_slug`

Readers accept the old snapshot aliases where practical so legacy fixtures can
be inspected, but new files should use the artifact-oriented vocabulary.

## CLI

```bash
python3 -m infospace_bench metrics infospaces/bootstrap-pilot
python3 -m infospace_bench history infospaces/bootstrap-pilot
python3 -m infospace_bench history infospaces/bootstrap-pilot --metric coverage_ratio
python3 -m infospace_bench history-diff infospaces/bootstrap-pilot snap-a snap-b
```

Snapshot references may be exact snapshot IDs or ISO-like dates such as
`2026-05-14`. Date references resolve to the nearest snapshot in the history.