diff --git a/examples/infospace-with-history/INFRA-TASKS.md b/examples/infospace-with-history/INFRA-TASKS.md index 2d848b9e..a4d1498e 100644 --- a/examples/infospace-with-history/INFRA-TASKS.md +++ b/examples/infospace-with-history/INFRA-TASKS.md @@ -57,7 +57,7 @@ How the example measures against the objectives stated in `README.md`: | 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. | | 10 | Generate task list for infra issues | **Done** | This file. | -## 4. Infospace has no per-chapter git history — OPEN +## 4. Infospace has no per-chapter git history — PARTIAL **Objective:** README states "The information space should utilize the option of keeping changes as git history." @@ -69,12 +69,15 @@ archive policy. There is no commit where you can `git diff` to see exactly what one chapter contributed to the infospace. **Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how the infospace grew chapter by chapter — the core promise of "with history." -**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using +**Progress:** Branch `clean-example-history` was created. Chapters 1-8 have +clean per-chapter commits. 27 chapters remain. Example completeness (tasks 4 +and 7) is deferred; no further action planned. +**Suggested fix (original):** Re-run the processed chapters using `process_chapters.py` without `--no-commit`, on a clean branch or after squashing the current output into a baseline commit. Each chapter gets its own commit via `_git_commit_chapter()`. -## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN +## 5. Prompt files are regenerated as a side-effect of DB rebuild — RESOLVED **Issue:** Running `--all --no-commit` to regenerate `infospace.db` also overwrites `*-prompt.md` files in the output directories because each @@ -85,9 +88,10 @@ chapters change on every full run. **Impact:** A DB regeneration dirties the working tree with prompt file changes, even though no actual outputs changed. Users must `git checkout` the prompt files after regeneration. -**Suggested fix:** Skip writing prompt files when the corresponding output -file already exists on disk, or add a `--rebuild-db-only` flag that -populates the database without touching the file system. +**Fix applied:** Each pipeline stage (`stage_extract_entities`, +`stage_map_to_vsm`, `stage_synthesize_analysis`, `assess_metrics`) now +skips writing the `*-prompt.md` file when the corresponding output file +already exists on disk. DB regeneration no longer dirties the working tree. ## 6. Metrics report is stale — OPEN @@ -99,15 +103,16 @@ the report has not been refreshed. after every batch of new chapters. Consider making metrics assessment automatic at the end of `--book` or `--all` runs. -## 7. Remaining 28 chapters not yet processed — OPEN +## 7. Remaining 28 chapters not yet processed — DEFERRED **Issue:** Only Book I chapters 1-7 have been processed. Books II-V (28 chapters) remain unprocessed. **Impact:** The infospace is incomplete — VSM coverage is limited to S1, S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic signals, recursion, variety) are expected to emerge from later books. -**Suggested fix:** Process remaining chapters in book-sized batches with -per-chapter commits, refreshing metrics after each book. +**Note:** Example completeness is deferred. The 7/35 chapter corpus is +sufficient to validate the tooling. Resuming requires the `clean-example-history` +branch and a valid `OPENROUTER_API_KEY`. --- @@ -130,7 +135,7 @@ The improvement splits metrics into two layers: Both layers persist results in structured form so they can be diffed, tracked over time, and committed alongside the entities they evaluate. -## 8. Add per-concept quality metrics to entity schema — OPEN +## 8. Add per-concept quality metrics to entity schema — RESOLVED **Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines required sections and validation rules (section presence, word count range) @@ -158,8 +163,10 @@ Similarly update the VSM mapping schema with: Weak) consistent with the rationale given? These rubrics become the prompt instructions for task 9. +**Fix applied:** `## Quality Metrics` section added to +`schemas/economic-entity-schema-v1.0.md` and `schemas/vsm-mapping-schema-v1.0.md`. -## 9. Create evaluate-entity prompt template — OPEN +## 9. Create evaluate-entity prompt template — RESOLVED **Depends on:** Task 8 (quality metrics in schema). **Issue:** There is no mechanism to evaluate an existing entity after @@ -193,8 +200,11 @@ Add a pipeline stage: `--evaluate` runs this template against every canonical entity and writes results to `output/evaluations/-eval.md`. A `--evaluate --chapter ` variant evaluates only entities introduced by that chapter. +**Fix applied:** `templates/evaluate-entity.md` created. `--evaluate` +flag added to `process_chapters.py`. Reads `@{quality_rubric}` from the +entity schema's Quality Metrics section. -## 10. Add deterministic schema compliance checker — OPEN +## 10. Add deterministic schema compliance checker — RESOLVED **Issue:** Schema compliance is currently LLM-evaluated ("100%" in the metrics report) but the validation rules in the schemas are mechanical: @@ -222,8 +232,10 @@ Validation: 85 entities, 3 warnings ``` This is fully deterministic — no LLM calls needed. +**Fix applied:** `markitect/infospace/validator.py` — `validate_entity()` +and `validate_entities()`. Exposed via `--infospace-check`. -## 11. Structured metrics output format — OPEN +## 11. Structured metrics output format — RESOLVED **Depends on:** Tasks 9 and 10. **Issue:** The metrics report is a markdown narrative. Values cannot be @@ -261,8 +273,9 @@ evaluation: # from LLM-eval (task 9) The `--metrics` command writes both files. The YAML file is committed to git so `git diff` shows exactly how metrics changed between runs. +**Fix applied:** `output/metrics/metrics.yaml` produced by `--infospace-check`. -## 12. Metrics-over-time tracking — OPEN +## 12. Metrics-over-time tracking — RESOLVED **Depends on:** Task 11 (structured output). **Issue:** There is one metrics snapshot that gets overwritten. No history @@ -283,6 +296,8 @@ Metrics history (5 snapshots): This provides the "metrics that improve over time" feedback loop the README envisions: process chapters → evaluate → see coverage grow (or flag regressions when a re-extraction reduces quality scores). +**Fix applied:** `output/metrics/history.yaml` maintained by +`markitect/infospace/history.py`. --- @@ -296,7 +311,7 @@ be built once per evaluation run. See the methodology document for theoretical grounding, framework references, and the full metric definitions per concern. -## 13. Entity metadata index — deterministic parsing layer — OPEN +## 13. Entity metadata index — deterministic parsing layer — RESOLVED **Depends on:** Task 10 (schema compliance checker shares parsing logic). **Issue:** Several collection-level metrics (coverage matrix, FCA context, @@ -324,8 +339,10 @@ class EntityMeta: Build an index of all entities at the start of each evaluation run. This index is the input for tasks 14, 16, and 18. Expose as `--index` CLI flag for inspection. +**Fix applied:** `markitect/infospace/entity_parser.py` — `parse_entity_file()` +and `parse_entity_directory()`. Used automatically by `--infospace-check`. -## 14. Redundancy detection (Concern C1) — OPEN +## 14. Redundancy detection (Concern C1) — RESOLVED **Depends on:** Task 13 (metadata index). **Methodology:** OOPS! P2 (synonymous classes) + embedding similarity + @@ -357,8 +374,9 @@ dedup only checks slug collisions. There is no semantic overlap detection. - `intensional_conciseness`: `1 - redundancy_ratio` **CLI:** `--check-redundancy --provider ` +**Fix applied:** `markitect/infospace/checks/redundancy.py`. Exposed via `--infospace-check`. -## 15. Coverage completeness (Concern C2) — OPEN +## 15. Coverage completeness (Concern C2) — RESOLVED **Depends on:** Task 13 (metadata index). **Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency @@ -399,8 +417,9 @@ questions about the economic system. - `competency_coverage`: fraction of questions answerable **CLI:** `--check-coverage --provider ` +**Fix applied:** `markitect/infospace/checks/coverage.py`. Exposed via `--infospace-check`. -## 16. Structural coherence (Concern C3) — OPEN +## 16. Structural coherence (Concern C3) — RESOLVED **Depends on:** Task 13 (metadata index). **Methodology:** OntoQA relationship richness + graph connectivity + @@ -440,8 +459,9 @@ between entities. - `cohesion_by_domain` / `coupling_across_domains`: scalars **CLI:** `--check-coherence --provider ` +**Fix applied:** `markitect/infospace/checks/coherence.py`. Exposed via `--infospace-check`. -## 17. Definitional consistency (Concern C4) — OPEN +## 17. Definitional consistency (Concern C4) — RESOLVED **Depends on:** Task 16 (relationship graph — the definitional dependency graph is a directed variant of the same structure). @@ -479,8 +499,9 @@ entities but aren't. - `source_fidelity_score`: fraction passing source check **CLI:** `--check-consistency --provider ` +**Fix applied:** `markitect/infospace/checks/consistency.py`. Exposed via `--infospace-check`. -## 18. Granularity balance (Concern C5) — OPEN +## 18. Granularity balance (Concern C5) — RESOLVED **Depends on:** Task 13 (metadata index). **Methodology:** Keet granularity theory + OntoClean rigidity + @@ -517,8 +538,9 @@ or whether some entities are too specific/general relative to their peers. - `split_candidates`: list of entities **CLI:** `--check-granularity --provider ` +**Fix applied:** `markitect/infospace/checks/granularity.py`. Exposed via `--infospace-check`. -## 19. Unified collection evaluation command — OPEN +## 19. Unified collection evaluation command — RESOLVED **Depends on:** Tasks 13-18. **Issue:** Running five separate `--check-*` commands is cumbersome and @@ -537,6 +559,10 @@ runs all five checks in sequence, sharing infrastructure: Incremental mode: `--evaluate-collection --chapter ` re-evaluates only entities from that chapter plus pairwise checks involving them. +**Fix applied:** `markitect/infospace/checks/orchestrator.py` + `--infospace-check` +CLI flag. All five checks share the metadata index. Results recorded in +`output/metrics/metrics.yaml` and `output/metrics/history.yaml`. + Report a summary to stdout: ``` diff --git a/examples/infospace-with-history/process_chapters.py b/examples/infospace-with-history/process_chapters.py index ca2e852b..d30e6528 100644 --- a/examples/infospace-with-history/process_chapters.py +++ b/examples/infospace-with-history/process_chapters.py @@ -487,14 +487,16 @@ class ChapterProcessor: if not prompt: return None - # Write compiled prompt for inspection - prompt_file = self._entities_dir() / f"{chapter_id}-prompt.md" - prompt_file.parent.mkdir(parents=True, exist_ok=True) - prompt_file.write_text(prompt) - print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") - view_file = self._entities_dir() / f"{chapter_id}-entities.md" + # Write compiled prompt only when no output exists yet (avoids dirty + # working tree on DB-only rebuilds — Task 5 fix) + prompt_file = self._entities_dir() / f"{chapter_id}-prompt.md" + if not (view_file.exists() and "{{ include" in view_file.read_text()): + prompt_file.parent.mkdir(parents=True, exist_ok=True) + prompt_file.write_text(prompt) + print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") + # ── PRIMARY: chapter view with transclusion already on disk ── if view_file.exists() and "{{ include" in view_file.read_text(): content, entity_files = self._read_entities_from_view(chapter_id) @@ -575,11 +577,14 @@ class ChapterProcessor: if not prompt: return None - prompt_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-prompt.md" - prompt_file.write_text(prompt) - print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") - output_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-mappings.md" + # Write compiled prompt only when output does not yet exist (Task 5 fix) + if not output_file.exists(): + prompt_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-prompt.md" + prompt_file.parent.mkdir(parents=True, exist_ok=True) + prompt_file.write_text(prompt) + print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") + if output_file.exists(): content = output_file.read_text() self.store_output_artifact( @@ -622,11 +627,14 @@ class ChapterProcessor: if not prompt: return None - prompt_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-prompt.md" - prompt_file.write_text(prompt) - print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") - output_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-analysis.md" + # Write compiled prompt only when output does not yet exist (Task 5 fix) + if not output_file.exists(): + prompt_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-prompt.md" + prompt_file.parent.mkdir(parents=True, exist_ok=True) + prompt_file.write_text(prompt) + print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") + if output_file.exists(): content = output_file.read_text() self.store_output_artifact( @@ -679,11 +687,14 @@ class ChapterProcessor: if not prompt: return None - prompt_file = self.example_dir / "output" / "metrics" / "metrics-prompt.md" - prompt_file.write_text(prompt) - print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") - output_file = self.example_dir / "output" / "metrics" / "metrics-report.md" + # Write compiled prompt only when output does not yet exist (Task 5 fix) + if not output_file.exists(): + prompt_file = self.example_dir / "output" / "metrics" / "metrics-prompt.md" + prompt_file.parent.mkdir(parents=True, exist_ok=True) + prompt_file.write_text(prompt) + print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}") + if output_file.exists(): content = output_file.read_text() self.store_output_artifact( @@ -709,6 +720,123 @@ class ChapterProcessor: print(f" Awaiting output at: {output_file.relative_to(self.example_dir)}") return None + # ── Entity Evaluation (Task 9) ──────────────────────────────────── + + def _extract_quality_rubric(self) -> str: + """Extract the Quality Metrics section from the entity schema file.""" + schema_file = self.example_dir / "schemas" / "economic-entity-schema-v1.0.md" + text = schema_file.read_text() + # Find the ## Quality Metrics section up to the next ## section + import re as _re + m = _re.search( + r"^## Quality Metrics\n(.*?)^## ", + text, + flags=_re.MULTILINE | _re.DOTALL, + ) + if m: + return ("## Quality Metrics\n" + m.group(1)).strip() + return text # fallback: whole schema + + def _extract_source_chapter_from_entity(self, entity_text: str) -> str: + """Extract the Source Chapter field from an entity markdown file.""" + import re as _re + m = _re.search( + r"^## Source Chapter\s*\n+(.+?)(?:\n\n|\n##|\Z)", + entity_text, + flags=_re.MULTILINE | _re.DOTALL, + ) + if m: + return m.group(1).strip() + return "Unknown chapter" + + def evaluate_entities(self, chapter_id: Optional[str] = None) -> None: + """Evaluate canonical entities using the evaluate-entity template. + + If *chapter_id* is given, evaluates only entities introduced by that + chapter (determined from the chapter view file). Otherwise evaluates + all canonical entities. + + Outputs are written to ``output/evaluations/-eval.md``. + Existing evaluation files are skipped (idempotent). + """ + evaluations_dir = self.example_dir / "output" / "evaluations" + evaluations_dir.mkdir(parents=True, exist_ok=True) + + # Determine which entity files to evaluate + if chapter_id: + view_file = self._entities_dir() / f"{chapter_id}-entities.md" + if not view_file.exists(): + print(f" No chapter view found for {chapter_id}") + return + _, entity_files = self._read_entities_from_view(chapter_id) + if not entity_files: + print(f" No entities found for chapter {chapter_id}") + return + print(f"Evaluating {len(entity_files)} entities from {chapter_id}...") + else: + slugs = self._list_existing_entity_names() + entity_files = [(s, self._entities_dir() / f"{s}.md") for s in slugs] + print(f"Evaluating {len(entity_files)} canonical entities...") + + if not entity_files: + print(" No entities to evaluate.") + return + + # Shared context loaded once + quality_rubric = self._extract_quality_rubric() + self.bind_macro_artifact(self.spaces["guidelines"], "quality_rubric", quality_rubric) + + done = 0 + skipped = 0 + failed = 0 + + for slug, entity_path in entity_files: + output_file = evaluations_dir / f"{slug}-eval.md" + if output_file.exists(): + skipped += 1 + continue + + if not entity_path.exists(): + print(f" MISSING: {entity_path.name}") + failed += 1 + continue + + entity_text = entity_path.read_text() + source_chapter = self._extract_source_chapter_from_entity(entity_text) + + # Bind per-entity macros + self.bind_macro_artifact(self.spaces["entities"], "entity_content", entity_text) + self.bind_macro_artifact(self.spaces["sources"], "source_chapter", source_chapter) + + prompt = self.resolve_and_compile( + "evaluate-entity", + ["entities", "sources", "vsm-reference", "guidelines"], + ) + if not prompt: + print(f" FAILED to compile prompt for {slug}") + failed += 1 + continue + + # Write prompt only when output does not yet exist (Task 5 fix) + prompt_file = evaluations_dir / f"{slug}-eval-prompt.md" + if not output_file.exists(): + prompt_file.write_text(prompt) + + if not self.llm_adapter: + print(f" {slug}: prompt written, awaiting manual evaluation") + done += 1 + continue + + print(f" Evaluating: {slug}...") + content = self._execute_llm(prompt, output_file, f"eval:{slug}", max_tokens=1024) + if content: + done += 1 + else: + failed += 1 + + total = done + skipped + failed + print(f"\nEvaluation complete: {done} done, {skipped} skipped (existing), {failed} failed — {total} total") + # ── Chapter Processing ─────────────────────────────────────────── def process_chapter(self, chapter_id: str, auto_commit: bool = True): @@ -994,9 +1122,13 @@ def main(): help="Run collection-level quality checks (C1-C5)") group.add_argument("--infospace-viability", action="store_true", help="Show viability dashboard") + group.add_argument("--evaluate", action="store_true", + help="Evaluate entity quality using the evaluate-entity template") parser.add_argument("--reason", type=str, default=None, help="Reason for archiving (used with --archive-entity)") + parser.add_argument("--eval-chapter", type=str, default=None, metavar="CHAPTER_ID", + help="Limit --evaluate to entities from a specific chapter") parser.add_argument("--no-commit", action="store_true", help="Skip git commits") parser.add_argument( "--provider", @@ -1064,6 +1196,9 @@ def main(): elif args.infospace_viability: _run_infospace_viability(example_dir) return + elif args.evaluate: + processor.evaluate_entities(chapter_id=args.eval_chapter) + return processor.show_stats() diff --git a/examples/infospace-with-history/schemas/economic-entity-schema-v1.0.md b/examples/infospace-with-history/schemas/economic-entity-schema-v1.0.md index 96d3df71..4a64034b 100644 --- a/examples/infospace-with-history/schemas/economic-entity-schema-v1.0.md +++ b/examples/infospace-with-history/schemas/economic-entity-schema-v1.0.md @@ -39,6 +39,45 @@ this entity. Must be enclosed in quotation marks with chapter reference. How this entity is understood in modern economic theory, including any evolution in meaning since Smith's time. +## Quality Metrics + +Used by the `evaluate-entity` prompt template to score each entity on five +dimensions. Each dimension is scored 1–5, where 1 = very poor and 5 = excellent. + +### Definition Precision (1-5) +Is the definition specific, non-circular, and clearly distinguishable from +neighbouring concepts? A score of 5 means the definition uniquely identifies +the concept without relying on terms that are themselves undefined within the +infospace. A score of 1 means the definition is vague, tautological, or +indistinguishable from another entity. + +### Source Grounding (1-5) +Is the entity grounded in a specific, verifiable passage from the source text? +A score of 5 means a citation is present, the cited chapter exists, and the +definition accurately reflects the cited passage. A score of 1 means no +citation is given or the definition contradicts the source. + +### Domain Placement (1-5) +Is the economic domain assignment correct and specific? A score of 5 means +the assigned domain (e.g., Production, Distribution) is the most precise +fit and would not be improved by a different choice. A score of 1 means the +domain is wrong, or "General Theory" is used when a more specific domain +applies. + +### VSM Relevance (1-5) +Does this entity connect meaningfully to at least one VSM system (S1–S5, +recursion, variety, algedonic signals)? A score of 5 means the entity is +directly mappable to a VSM concept with a clear structural rationale. A +score of 1 means the entity has no discernible VSM connection and may be +too granular or peripheral to the system model. + +### Explanatory Value (1-5) +Does this entity contribute to explaining the economic system as a whole, or +is it a restatement of another concept? A score of 5 means removing this +entity would leave a meaningful gap in the infospace. A score of 1 means +another entity already covers this ground, or the entity adds no +explanatory power. + ## Validation Rules 1. The document MUST contain an H1 heading with the entity name. diff --git a/examples/infospace-with-history/schemas/vsm-mapping-schema-v1.0.md b/examples/infospace-with-history/schemas/vsm-mapping-schema-v1.0.md index 1148069d..6da0fe6d 100644 --- a/examples/infospace-with-history/schemas/vsm-mapping-schema-v1.0.md +++ b/examples/infospace-with-history/schemas/vsm-mapping-schema-v1.0.md @@ -33,6 +33,25 @@ might not fit the VSM concept perfectly. Other VSM concepts this entity could plausibly map to, with brief rationale for each alternative. +## Quality Metrics + +Used by the `evaluate-entity` prompt template when assessing mapping quality. +Each dimension is scored 1–5, where 1 = very poor and 5 = excellent. + +### Rationale Rigour (1-5) +Is the mapping justified with reference to Beer's VSM definitions, not just +surface-level analogy? A score of 5 means the rationale cites specific VSM +properties (e.g., "S2 attenuates variety between S1 units") and shows how +the economic entity fulfils that role. A score of 1 means the rationale is +a loose metaphor with no structural grounding. + +### Strength Calibration (1-5) +Is the declared Mapping Strength (Strong, Moderate, Weak) consistent with +the rationale given? A score of 5 means the declared strength matches the +depth of correspondence described. A score of 1 means the strength is +overclaimed (e.g., "Strong" for a tangential analogy) or underclaimed +(e.g., "Weak" for a direct structural match). + ## Validation Rules 1. The document MUST contain an H1 heading in the format "Entity Name -> VSM Concept Name".