--- id: IB-WP-0008 type: workplan title: "Evaluation History And Metrics Parity" domain: markitect repo: infospace-bench status: completed owner: markitect topic_slug: markitect created: "2026-05-14" updated: "2026-05-14" state_hub_workstream_slug: "ib-wp-0008-evaluation-history-metrics-parity" state_hub_workstream_id: "f00ba036-dc97-4370-a4a5-9ac2bce7ce6f" --- # IB-WP-0008 — Evaluation History And Metrics Parity ## Goal Bring the current evaluation dataclasses up to practical parity with the old infospace evaluation history and metrics behavior. ## Tasks ### T01 — Evaluation file I/O ```task id: IB-WP-0008-T01 status: done priority: high state_hub_task_id: "95b48ad3-c4d1-442c-9bc0-7591d948d23e" ``` - Write and read per-entity/per-artifact evaluation files - Preserve human-readable Markdown body plus structured metadata - Add round-trip tests ### T02 — Snapshot history ```task id: IB-WP-0008-T02 status: done priority: high state_hub_task_id: "b4800ba8-5b86-44bb-bf47-e893bae36b22" ``` - Write, append, and read evaluation snapshot history - Support diffing named snapshots from disk - Add CLI support for history and history-diff behavior ### T03 — Metrics merge and viability reports ```task id: IB-WP-0008-T03 status: done priority: high state_hub_task_id: "7abcbd63-0147-4ae8-85f0-4af51882f476" ``` - Persist latest metrics under `output/metrics` - Merge collection metrics with evaluation-derived metrics - Emit structured viability reports ### T04 — Legacy metric compatibility notes ```task id: IB-WP-0008-T04 status: done priority: medium state_hub_task_id: "675d1d45-39d9-4ddd-9ab7-5d7de8a0f601" ``` - Compare old metric names and meanings to new baseline - Document changed semantics and accepted differences - Add fixtures that preserve critical old behavior ## Acceptance - Evaluation and metrics history are inspectable, diffable, and reproducible - Viability reports can be generated from committed files - Old evaluation-history workflows have a clear replacement path