fix(example): skip prompt writes when output exists, add quality rubrics

INFRA-TASKS #5 — process_chapters.py now skips writing *-prompt.md files when the corresponding output file already exists on disk. DB-only rebuilds no longer dirty the working tree with unchanged prompt content. INFRA-TASKS #8 — Added '## Quality Metrics' section to the entity and VSM mapping schemas, defining the five evaluation dimensions (Definition Precision, Source Grounding, Domain Placement, VSM Relevance, Explanatory Value) with 1–5 rubrics used by the evaluate-entity template. Also updated INFRA-TASKS.md to reflect current resolution status for tasks 4–19 across S2 and S3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 06:04:09 +01:00
parent dfab3d598b
commit fa27572f43
4 changed files with 258 additions and 39 deletions
--- a/examples/infospace-with-history/INFRA-TASKS.md
+++ b/examples/infospace-with-history/INFRA-TASKS.md
@@ -57,7 +57,7 @@ How the example measures against the objectives stated in `README.md`:
 | 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. |
 | 10 | Generate task list for infra issues | **Done** | This file. |

-## 4. Infospace has no per-chapter git history — OPEN
+## 4. Infospace has no per-chapter git history — PARTIAL

 **Objective:** README states "The information space should utilize the option
 of keeping changes as git history."
@@ -69,12 +69,15 @@ archive policy. There is no commit where you can `git diff` to see exactly
 what one chapter contributed to the infospace.
 **Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how
 the infospace grew chapter by chapter — the core promise of "with history."
-**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using
+**Progress:** Branch `clean-example-history` was created. Chapters 1-8 have
+clean per-chapter commits. 27 chapters remain. Example completeness (tasks 4
+and 7) is deferred; no further action planned.
+**Suggested fix (original):** Re-run the processed chapters using
 `process_chapters.py` without `--no-commit`, on a clean branch or after
 squashing the current output into a baseline commit. Each chapter gets its
 own commit via `_git_commit_chapter()`.

-## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN
+## 5. Prompt files are regenerated as a side-effect of DB rebuild — RESOLVED

 **Issue:** Running `--all --no-commit` to regenerate `infospace.db` also
 overwrites `*-prompt.md` files in the output directories because each
@@ -85,9 +88,10 @@ chapters change on every full run.
 **Impact:** A DB regeneration dirties the working tree with prompt file
 changes, even though no actual outputs changed. Users must `git checkout`
 the prompt files after regeneration.
-**Suggested fix:** Skip writing prompt files when the corresponding output
-file already exists on disk, or add a `--rebuild-db-only` flag that
-populates the database without touching the file system.
+**Fix applied:** Each pipeline stage (`stage_extract_entities`,
+`stage_map_to_vsm`, `stage_synthesize_analysis`, `assess_metrics`) now
+skips writing the `*-prompt.md` file when the corresponding output file
+already exists on disk. DB regeneration no longer dirties the working tree.

 ## 6. Metrics report is stale — OPEN

@@ -99,15 +103,16 @@ the report has not been refreshed.
 after every batch of new chapters. Consider making metrics assessment
 automatic at the end of `--book` or `--all` runs.

-## 7. Remaining 28 chapters not yet processed — OPEN
+## 7. Remaining 28 chapters not yet processed — DEFERRED

 **Issue:** Only Book I chapters 1-7 have been processed. Books II-V
 (28 chapters) remain unprocessed.
 **Impact:** The infospace is incomplete — VSM coverage is limited to S1,
 S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic
 signals, recursion, variety) are expected to emerge from later books.
-**Suggested fix:** Process remaining chapters in book-sized batches with
-per-chapter commits, refreshing metrics after each book.
+**Note:** Example completeness is deferred. The 7/35 chapter corpus is
+sufficient to validate the tooling. Resuming requires the `clean-example-history`
+branch and a valid `OPENROUTER_API_KEY`.

 ---

@@ -130,7 +135,7 @@ The improvement splits metrics into two layers:
 Both layers persist results in structured form so they can be diffed,
 tracked over time, and committed alongside the entities they evaluate.

-## 8. Add per-concept quality metrics to entity schema — OPEN
+## 8. Add per-concept quality metrics to entity schema — RESOLVED

 **Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines
 required sections and validation rules (section presence, word count range)
@@ -158,8 +163,10 @@ Similarly update the VSM mapping schema with:
  Weak) consistent with the rationale given?

 These rubrics become the prompt instructions for task 9.
+**Fix applied:** `## Quality Metrics` section added to
+`schemas/economic-entity-schema-v1.0.md` and `schemas/vsm-mapping-schema-v1.0.md`.

-## 9. Create evaluate-entity prompt template — OPEN
+## 9. Create evaluate-entity prompt template — RESOLVED

 **Depends on:** Task 8 (quality metrics in schema).
 **Issue:** There is no mechanism to evaluate an existing entity after
@@ -193,8 +200,11 @@ Add a pipeline stage: `--evaluate` runs this template against every
 canonical entity and writes results to `output/evaluations/<slug>-eval.md`.
 A `--evaluate --chapter <id>` variant evaluates only entities introduced
 by that chapter.
+**Fix applied:** `templates/evaluate-entity.md` created. `--evaluate`
+flag added to `process_chapters.py`. Reads `@{quality_rubric}` from the
+entity schema's Quality Metrics section.

-## 10. Add deterministic schema compliance checker — OPEN
+## 10. Add deterministic schema compliance checker — RESOLVED

 **Issue:** Schema compliance is currently LLM-evaluated ("100%" in the
 metrics report) but the validation rules in the schemas are mechanical:
@@ -222,8 +232,10 @@ Validation: 85 entities, 3 warnings
 ```

 This is fully deterministic — no LLM calls needed.
+**Fix applied:** `markitect/infospace/validator.py` — `validate_entity()`
+and `validate_entities()`. Exposed via `--infospace-check`.

-## 11. Structured metrics output format — OPEN
+## 11. Structured metrics output format — RESOLVED

 **Depends on:** Tasks 9 and 10.
 **Issue:** The metrics report is a markdown narrative. Values cannot be
@@ -261,8 +273,9 @@ evaluation:    # from LLM-eval (task 9)

 The `--metrics` command writes both files. The YAML file is committed
 to git so `git diff` shows exactly how metrics changed between runs.
+**Fix applied:** `output/metrics/metrics.yaml` produced by `--infospace-check`.

-## 12. Metrics-over-time tracking — OPEN
+## 12. Metrics-over-time tracking — RESOLVED

 **Depends on:** Task 11 (structured output).
 **Issue:** There is one metrics snapshot that gets overwritten. No history
@@ -283,6 +296,8 @@ Metrics history (5 snapshots):
 This provides the "metrics that improve over time" feedback loop the
 README envisions: process chapters → evaluate → see coverage grow (or
 flag regressions when a re-extraction reduces quality scores).
+**Fix applied:** `output/metrics/history.yaml` maintained by
+`markitect/infospace/history.py`.

 ---

@@ -296,7 +311,7 @@ be built once per evaluation run.
 See the methodology document for theoretical grounding, framework
 references, and the full metric definitions per concern.

-## 13. Entity metadata index — deterministic parsing layer — OPEN
+## 13. Entity metadata index — deterministic parsing layer — RESOLVED

 **Depends on:** Task 10 (schema compliance checker shares parsing logic).
 **Issue:** Several collection-level metrics (coverage matrix, FCA context,
@@ -324,8 +339,10 @@ class EntityMeta:
 Build an index of all entities at the start of each evaluation run.
 This index is the input for tasks 14, 16, and 18. Expose as
 `--index` CLI flag for inspection.
+**Fix applied:** `markitect/infospace/entity_parser.py` — `parse_entity_file()`
+and `parse_entity_directory()`. Used automatically by `--infospace-check`.

-## 14. Redundancy detection (Concern C1) — OPEN
+## 14. Redundancy detection (Concern C1) — RESOLVED

 **Depends on:** Task 13 (metadata index).
 **Methodology:** OOPS! P2 (synonymous classes) + embedding similarity +
@@ -357,8 +374,9 @@ dedup only checks slug collisions. There is no semantic overlap detection.
 - `intensional_conciseness`: `1 - redundancy_ratio`

 **CLI:** `--check-redundancy --provider <provider>`
+**Fix applied:** `markitect/infospace/checks/redundancy.py`. Exposed via `--infospace-check`.

-## 15. Coverage completeness (Concern C2) — OPEN
+## 15. Coverage completeness (Concern C2) — RESOLVED

 **Depends on:** Task 13 (metadata index).
 **Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency
@@ -399,8 +417,9 @@ questions about the economic system.
 - `competency_coverage`: fraction of questions answerable

 **CLI:** `--check-coverage --provider <provider>`
+**Fix applied:** `markitect/infospace/checks/coverage.py`. Exposed via `--infospace-check`.

-## 16. Structural coherence (Concern C3) — OPEN
+## 16. Structural coherence (Concern C3) — RESOLVED

 **Depends on:** Task 13 (metadata index).
 **Methodology:** OntoQA relationship richness + graph connectivity +
@@ -440,8 +459,9 @@ between entities.
 - `cohesion_by_domain` / `coupling_across_domains`: scalars

 **CLI:** `--check-coherence --provider <provider>`
+**Fix applied:** `markitect/infospace/checks/coherence.py`. Exposed via `--infospace-check`.

-## 17. Definitional consistency (Concern C4) — OPEN
+## 17. Definitional consistency (Concern C4) — RESOLVED

 **Depends on:** Task 16 (relationship graph — the definitional dependency
 graph is a directed variant of the same structure).
@@ -479,8 +499,9 @@ entities but aren't.
 - `source_fidelity_score`: fraction passing source check

 **CLI:** `--check-consistency --provider <provider>`
+**Fix applied:** `markitect/infospace/checks/consistency.py`. Exposed via `--infospace-check`.

-## 18. Granularity balance (Concern C5) — OPEN
+## 18. Granularity balance (Concern C5) — RESOLVED

 **Depends on:** Task 13 (metadata index).
 **Methodology:** Keet granularity theory + OntoClean rigidity +
@@ -517,8 +538,9 @@ or whether some entities are too specific/general relative to their peers.
 - `split_candidates`: list of entities

 **CLI:** `--check-granularity --provider <provider>`
+**Fix applied:** `markitect/infospace/checks/granularity.py`. Exposed via `--infospace-check`.

-## 19. Unified collection evaluation command — OPEN
+## 19. Unified collection evaluation command — RESOLVED

 **Depends on:** Tasks 13-18.
 **Issue:** Running five separate `--check-*` commands is cumbersome and
@@ -537,6 +559,10 @@ runs all five checks in sequence, sharing infrastructure:
 Incremental mode: `--evaluate-collection --chapter <id>` re-evaluates
 only entities from that chapter plus pairwise checks involving them.

+**Fix applied:** `markitect/infospace/checks/orchestrator.py` + `--infospace-check`
+CLI flag. All five checks share the metadata index. Results recorded in
+`output/metrics/metrics.yaml` and `output/metrics/history.yaml`.
+
 Report a summary to stdout:

 ```
--- a/examples/infospace-with-history/process_chapters.py
+++ b/examples/infospace-with-history/process_chapters.py
@@ -487,14 +487,16 @@ class ChapterProcessor:
        if not prompt:
            return None

-        # Write compiled prompt for inspection
-        prompt_file = self._entities_dir() / f"{chapter_id}-prompt.md"
-        prompt_file.parent.mkdir(parents=True, exist_ok=True)
-        prompt_file.write_text(prompt)
-        print(f"        Prompt written to {prompt_file.relative_to(self.example_dir)}")
-
        view_file = self._entities_dir() / f"{chapter_id}-entities.md"

+        # Write compiled prompt only when no output exists yet (avoids dirty
+        # working tree on DB-only rebuilds — Task 5 fix)
+        prompt_file = self._entities_dir() / f"{chapter_id}-prompt.md"
+        if not (view_file.exists() and "{{ include" in view_file.read_text()):
+            prompt_file.parent.mkdir(parents=True, exist_ok=True)
+            prompt_file.write_text(prompt)
+            print(f"        Prompt written to {prompt_file.relative_to(self.example_dir)}")
+
        # ── PRIMARY: chapter view with transclusion already on disk ──
        if view_file.exists() and "{{ include" in view_file.read_text():
            content, entity_files = self._read_entities_from_view(chapter_id)
@@ -575,11 +577,14 @@ class ChapterProcessor:
        if not prompt:
            return None

-        prompt_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-prompt.md"
-        prompt_file.write_text(prompt)
-        print(f"        Prompt written to {prompt_file.relative_to(self.example_dir)}")
-
        output_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-mappings.md"
+        # Write compiled prompt only when output does not yet exist (Task 5 fix)
+        if not output_file.exists():
+            prompt_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-prompt.md"
+            prompt_file.parent.mkdir(parents=True, exist_ok=True)
+            prompt_file.write_text(prompt)
+            print(f"        Prompt written to {prompt_file.relative_to(self.example_dir)}")
+
        if output_file.exists():
            content = output_file.read_text()
            self.store_output_artifact(
@@ -622,11 +627,14 @@ class ChapterProcessor:
        if not prompt:
            return None

-        prompt_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-prompt.md"
-        prompt_file.write_text(prompt)
-        print(f"        Prompt written to {prompt_file.relative_to(self.example_dir)}")
-
        output_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-analysis.md"
+        # Write compiled prompt only when output does not yet exist (Task 5 fix)
+        if not output_file.exists():
+            prompt_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-prompt.md"
+            prompt_file.parent.mkdir(parents=True, exist_ok=True)
+            prompt_file.write_text(prompt)
+            print(f"        Prompt written to {prompt_file.relative_to(self.example_dir)}")
+
        if output_file.exists():
            content = output_file.read_text()
            self.store_output_artifact(
@@ -679,11 +687,14 @@ class ChapterProcessor:
        if not prompt:
            return None

-        prompt_file = self.example_dir / "output" / "metrics" / "metrics-prompt.md"
-        prompt_file.write_text(prompt)
-        print(f"  Prompt written to {prompt_file.relative_to(self.example_dir)}")
-
        output_file = self.example_dir / "output" / "metrics" / "metrics-report.md"
+        # Write compiled prompt only when output does not yet exist (Task 5 fix)
+        if not output_file.exists():
+            prompt_file = self.example_dir / "output" / "metrics" / "metrics-prompt.md"
+            prompt_file.parent.mkdir(parents=True, exist_ok=True)
+            prompt_file.write_text(prompt)
+            print(f"  Prompt written to {prompt_file.relative_to(self.example_dir)}")
+
        if output_file.exists():
            content = output_file.read_text()
            self.store_output_artifact(
@@ -709,6 +720,123 @@ class ChapterProcessor:
        print(f"  Awaiting output at: {output_file.relative_to(self.example_dir)}")
        return None

+    # ── Entity Evaluation (Task 9) ────────────────────────────────────
+
+    def _extract_quality_rubric(self) -> str:
+        """Extract the Quality Metrics section from the entity schema file."""
+        schema_file = self.example_dir / "schemas" / "economic-entity-schema-v1.0.md"
+        text = schema_file.read_text()
+        # Find the ## Quality Metrics section up to the next ## section
+        import re as _re
+        m = _re.search(
+            r"^## Quality Metrics\n(.*?)^## ",
+            text,
+            flags=_re.MULTILINE | _re.DOTALL,
+        )
+        if m:
+            return ("## Quality Metrics\n" + m.group(1)).strip()
+        return text  # fallback: whole schema
+
+    def _extract_source_chapter_from_entity(self, entity_text: str) -> str:
+        """Extract the Source Chapter field from an entity markdown file."""
+        import re as _re
+        m = _re.search(
+            r"^## Source Chapter\s*\n+(.+?)(?:\n\n|\n##|\Z)",
+            entity_text,
+            flags=_re.MULTILINE | _re.DOTALL,
+        )
+        if m:
+            return m.group(1).strip()
+        return "Unknown chapter"
+
+    def evaluate_entities(self, chapter_id: Optional[str] = None) -> None:
+        """Evaluate canonical entities using the evaluate-entity template.
+
+        If *chapter_id* is given, evaluates only entities introduced by that
+        chapter (determined from the chapter view file). Otherwise evaluates
+        all canonical entities.
+
+        Outputs are written to ``output/evaluations/<slug>-eval.md``.
+        Existing evaluation files are skipped (idempotent).
+        """
+        evaluations_dir = self.example_dir / "output" / "evaluations"
+        evaluations_dir.mkdir(parents=True, exist_ok=True)
+
+        # Determine which entity files to evaluate
+        if chapter_id:
+            view_file = self._entities_dir() / f"{chapter_id}-entities.md"
+            if not view_file.exists():
+                print(f"  No chapter view found for {chapter_id}")
+                return
+            _, entity_files = self._read_entities_from_view(chapter_id)
+            if not entity_files:
+                print(f"  No entities found for chapter {chapter_id}")
+                return
+            print(f"Evaluating {len(entity_files)} entities from {chapter_id}...")
+        else:
+            slugs = self._list_existing_entity_names()
+            entity_files = [(s, self._entities_dir() / f"{s}.md") for s in slugs]
+            print(f"Evaluating {len(entity_files)} canonical entities...")
+
+        if not entity_files:
+            print("  No entities to evaluate.")
+            return
+
+        # Shared context loaded once
+        quality_rubric = self._extract_quality_rubric()
+        self.bind_macro_artifact(self.spaces["guidelines"], "quality_rubric", quality_rubric)
+
+        done = 0
+        skipped = 0
+        failed = 0
+
+        for slug, entity_path in entity_files:
+            output_file = evaluations_dir / f"{slug}-eval.md"
+            if output_file.exists():
+                skipped += 1
+                continue
+
+            if not entity_path.exists():
+                print(f"  MISSING: {entity_path.name}")
+                failed += 1
+                continue
+
+            entity_text = entity_path.read_text()
+            source_chapter = self._extract_source_chapter_from_entity(entity_text)
+
+            # Bind per-entity macros
+            self.bind_macro_artifact(self.spaces["entities"], "entity_content", entity_text)
+            self.bind_macro_artifact(self.spaces["sources"], "source_chapter", source_chapter)
+
+            prompt = self.resolve_and_compile(
+                "evaluate-entity",
+                ["entities", "sources", "vsm-reference", "guidelines"],
+            )
+            if not prompt:
+                print(f"  FAILED to compile prompt for {slug}")
+                failed += 1
+                continue
+
+            # Write prompt only when output does not yet exist (Task 5 fix)
+            prompt_file = evaluations_dir / f"{slug}-eval-prompt.md"
+            if not output_file.exists():
+                prompt_file.write_text(prompt)
+
+            if not self.llm_adapter:
+                print(f"  {slug}: prompt written, awaiting manual evaluation")
+                done += 1
+                continue
+
+            print(f"  Evaluating: {slug}...")
+            content = self._execute_llm(prompt, output_file, f"eval:{slug}", max_tokens=1024)
+            if content:
+                done += 1
+            else:
+                failed += 1
+
+        total = done + skipped + failed
+        print(f"\nEvaluation complete: {done} done, {skipped} skipped (existing), {failed} failed — {total} total")
+
    # ── Chapter Processing ───────────────────────────────────────────

    def process_chapter(self, chapter_id: str, auto_commit: bool = True):
@@ -994,9 +1122,13 @@ def main():
                       help="Run collection-level quality checks (C1-C5)")
    group.add_argument("--infospace-viability", action="store_true",
                       help="Show viability dashboard")
+    group.add_argument("--evaluate", action="store_true",
+                       help="Evaluate entity quality using the evaluate-entity template")

    parser.add_argument("--reason", type=str, default=None,
                        help="Reason for archiving (used with --archive-entity)")
+    parser.add_argument("--eval-chapter", type=str, default=None, metavar="CHAPTER_ID",
+                        help="Limit --evaluate to entities from a specific chapter")
    parser.add_argument("--no-commit", action="store_true", help="Skip git commits")
    parser.add_argument(
        "--provider",
@@ -1064,6 +1196,9 @@ def main():
    elif args.infospace_viability:
        _run_infospace_viability(example_dir)
        return
+    elif args.evaluate:
+        processor.evaluate_entities(chapter_id=args.eval_chapter)
+        return

    processor.show_stats()

--- a/examples/infospace-with-history/schemas/economic-entity-schema-v1.0.md
+++ b/examples/infospace-with-history/schemas/economic-entity-schema-v1.0.md
@@ -39,6 +39,45 @@ this entity. Must be enclosed in quotation marks with chapter reference.
 How this entity is understood in modern economic theory, including
 any evolution in meaning since Smith's time.

+## Quality Metrics
+
+Used by the `evaluate-entity` prompt template to score each entity on five
+dimensions. Each dimension is scored 1–5, where 1 = very poor and 5 = excellent.
+
+### Definition Precision (1-5)
+Is the definition specific, non-circular, and clearly distinguishable from
+neighbouring concepts? A score of 5 means the definition uniquely identifies
+the concept without relying on terms that are themselves undefined within the
+infospace. A score of 1 means the definition is vague, tautological, or
+indistinguishable from another entity.
+
+### Source Grounding (1-5)
+Is the entity grounded in a specific, verifiable passage from the source text?
+A score of 5 means a citation is present, the cited chapter exists, and the
+definition accurately reflects the cited passage. A score of 1 means no
+citation is given or the definition contradicts the source.
+
+### Domain Placement (1-5)
+Is the economic domain assignment correct and specific? A score of 5 means
+the assigned domain (e.g., Production, Distribution) is the most precise
+fit and would not be improved by a different choice. A score of 1 means the
+domain is wrong, or "General Theory" is used when a more specific domain
+applies.
+
+### VSM Relevance (1-5)
+Does this entity connect meaningfully to at least one VSM system (S1–S5,
+recursion, variety, algedonic signals)? A score of 5 means the entity is
+directly mappable to a VSM concept with a clear structural rationale. A
+score of 1 means the entity has no discernible VSM connection and may be
+too granular or peripheral to the system model.
+
+### Explanatory Value (1-5)
+Does this entity contribute to explaining the economic system as a whole, or
+is it a restatement of another concept? A score of 5 means removing this
+entity would leave a meaningful gap in the infospace. A score of 1 means
+another entity already covers this ground, or the entity adds no
+explanatory power.
+
 ## Validation Rules

 1. The document MUST contain an H1 heading with the entity name.
--- a/examples/infospace-with-history/schemas/vsm-mapping-schema-v1.0.md
+++ b/examples/infospace-with-history/schemas/vsm-mapping-schema-v1.0.md
@@ -33,6 +33,25 @@ might not fit the VSM concept perfectly.
 Other VSM concepts this entity could plausibly map to,
 with brief rationale for each alternative.

+## Quality Metrics
+
+Used by the `evaluate-entity` prompt template when assessing mapping quality.
+Each dimension is scored 1–5, where 1 = very poor and 5 = excellent.
+
+### Rationale Rigour (1-5)
+Is the mapping justified with reference to Beer's VSM definitions, not just
+surface-level analogy? A score of 5 means the rationale cites specific VSM
+properties (e.g., "S2 attenuates variety between S1 units") and shows how
+the economic entity fulfils that role. A score of 1 means the rationale is
+a loose metaphor with no structural grounding.
+
+### Strength Calibration (1-5)
+Is the declared Mapping Strength (Strong, Moderate, Weak) consistent with
+the rationale given? A score of 5 means the declared strength matches the
+depth of correspondence described. A score of 1 means the strength is
+overclaimed (e.g., "Strong" for a tangential analogy) or underclaimed
+(e.g., "Weak" for a direct structural match).
+
 ## Validation Rules

 1. The document MUST contain an H1 heading in the format "Entity Name -> VSM Concept Name".