fix(example): skip prompt writes when output exists, add quality rubrics

INFRA-TASKS #5 — process_chapters.py now skips writing *-prompt.md files
when the corresponding output file already exists on disk. DB-only rebuilds
no longer dirty the working tree with unchanged prompt content.

INFRA-TASKS #8 — Added '## Quality Metrics' section to the entity and VSM
mapping schemas, defining the five evaluation dimensions (Definition Precision,
Source Grounding, Domain Placement, VSM Relevance, Explanatory Value) with
1–5 rubrics used by the evaluate-entity template.

Also updated INFRA-TASKS.md to reflect current resolution status for tasks
4–19 across S2 and S3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-23 06:04:09 +01:00
parent dfab3d598b
commit fa27572f43
4 changed files with 258 additions and 39 deletions

View File

@@ -57,7 +57,7 @@ How the example measures against the objectives stated in `README.md`:
| 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. |
| 10 | Generate task list for infra issues | **Done** | This file. |
## 4. Infospace has no per-chapter git history — OPEN
## 4. Infospace has no per-chapter git history — PARTIAL
**Objective:** README states "The information space should utilize the option
of keeping changes as git history."
@@ -69,12 +69,15 @@ archive policy. There is no commit where you can `git diff` to see exactly
what one chapter contributed to the infospace.
**Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how
the infospace grew chapter by chapter — the core promise of "with history."
**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using
**Progress:** Branch `clean-example-history` was created. Chapters 1-8 have
clean per-chapter commits. 27 chapters remain. Example completeness (tasks 4
and 7) is deferred; no further action planned.
**Suggested fix (original):** Re-run the processed chapters using
`process_chapters.py` without `--no-commit`, on a clean branch or after
squashing the current output into a baseline commit. Each chapter gets its
own commit via `_git_commit_chapter()`.
## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN
## 5. Prompt files are regenerated as a side-effect of DB rebuild — RESOLVED
**Issue:** Running `--all --no-commit` to regenerate `infospace.db` also
overwrites `*-prompt.md` files in the output directories because each
@@ -85,9 +88,10 @@ chapters change on every full run.
**Impact:** A DB regeneration dirties the working tree with prompt file
changes, even though no actual outputs changed. Users must `git checkout`
the prompt files after regeneration.
**Suggested fix:** Skip writing prompt files when the corresponding output
file already exists on disk, or add a `--rebuild-db-only` flag that
populates the database without touching the file system.
**Fix applied:** Each pipeline stage (`stage_extract_entities`,
`stage_map_to_vsm`, `stage_synthesize_analysis`, `assess_metrics`) now
skips writing the `*-prompt.md` file when the corresponding output file
already exists on disk. DB regeneration no longer dirties the working tree.
## 6. Metrics report is stale — OPEN
@@ -99,15 +103,16 @@ the report has not been refreshed.
after every batch of new chapters. Consider making metrics assessment
automatic at the end of `--book` or `--all` runs.
## 7. Remaining 28 chapters not yet processed — OPEN
## 7. Remaining 28 chapters not yet processed — DEFERRED
**Issue:** Only Book I chapters 1-7 have been processed. Books II-V
(28 chapters) remain unprocessed.
**Impact:** The infospace is incomplete — VSM coverage is limited to S1,
S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic
signals, recursion, variety) are expected to emerge from later books.
**Suggested fix:** Process remaining chapters in book-sized batches with
per-chapter commits, refreshing metrics after each book.
**Note:** Example completeness is deferred. The 7/35 chapter corpus is
sufficient to validate the tooling. Resuming requires the `clean-example-history`
branch and a valid `OPENROUTER_API_KEY`.
---
@@ -130,7 +135,7 @@ The improvement splits metrics into two layers:
Both layers persist results in structured form so they can be diffed,
tracked over time, and committed alongside the entities they evaluate.
## 8. Add per-concept quality metrics to entity schema — OPEN
## 8. Add per-concept quality metrics to entity schema — RESOLVED
**Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines
required sections and validation rules (section presence, word count range)
@@ -158,8 +163,10 @@ Similarly update the VSM mapping schema with:
Weak) consistent with the rationale given?
These rubrics become the prompt instructions for task 9.
**Fix applied:** `## Quality Metrics` section added to
`schemas/economic-entity-schema-v1.0.md` and `schemas/vsm-mapping-schema-v1.0.md`.
## 9. Create evaluate-entity prompt template — OPEN
## 9. Create evaluate-entity prompt template — RESOLVED
**Depends on:** Task 8 (quality metrics in schema).
**Issue:** There is no mechanism to evaluate an existing entity after
@@ -193,8 +200,11 @@ Add a pipeline stage: `--evaluate` runs this template against every
canonical entity and writes results to `output/evaluations/<slug>-eval.md`.
A `--evaluate --chapter <id>` variant evaluates only entities introduced
by that chapter.
**Fix applied:** `templates/evaluate-entity.md` created. `--evaluate`
flag added to `process_chapters.py`. Reads `@{quality_rubric}` from the
entity schema's Quality Metrics section.
## 10. Add deterministic schema compliance checker — OPEN
## 10. Add deterministic schema compliance checker — RESOLVED
**Issue:** Schema compliance is currently LLM-evaluated ("100%" in the
metrics report) but the validation rules in the schemas are mechanical:
@@ -222,8 +232,10 @@ Validation: 85 entities, 3 warnings
```
This is fully deterministic — no LLM calls needed.
**Fix applied:** `markitect/infospace/validator.py``validate_entity()`
and `validate_entities()`. Exposed via `--infospace-check`.
## 11. Structured metrics output format — OPEN
## 11. Structured metrics output format — RESOLVED
**Depends on:** Tasks 9 and 10.
**Issue:** The metrics report is a markdown narrative. Values cannot be
@@ -261,8 +273,9 @@ evaluation: # from LLM-eval (task 9)
The `--metrics` command writes both files. The YAML file is committed
to git so `git diff` shows exactly how metrics changed between runs.
**Fix applied:** `output/metrics/metrics.yaml` produced by `--infospace-check`.
## 12. Metrics-over-time tracking — OPEN
## 12. Metrics-over-time tracking — RESOLVED
**Depends on:** Task 11 (structured output).
**Issue:** There is one metrics snapshot that gets overwritten. No history
@@ -283,6 +296,8 @@ Metrics history (5 snapshots):
This provides the "metrics that improve over time" feedback loop the
README envisions: process chapters → evaluate → see coverage grow (or
flag regressions when a re-extraction reduces quality scores).
**Fix applied:** `output/metrics/history.yaml` maintained by
`markitect/infospace/history.py`.
---
@@ -296,7 +311,7 @@ be built once per evaluation run.
See the methodology document for theoretical grounding, framework
references, and the full metric definitions per concern.
## 13. Entity metadata index — deterministic parsing layer — OPEN
## 13. Entity metadata index — deterministic parsing layer — RESOLVED
**Depends on:** Task 10 (schema compliance checker shares parsing logic).
**Issue:** Several collection-level metrics (coverage matrix, FCA context,
@@ -324,8 +339,10 @@ class EntityMeta:
Build an index of all entities at the start of each evaluation run.
This index is the input for tasks 14, 16, and 18. Expose as
`--index` CLI flag for inspection.
**Fix applied:** `markitect/infospace/entity_parser.py``parse_entity_file()`
and `parse_entity_directory()`. Used automatically by `--infospace-check`.
## 14. Redundancy detection (Concern C1) — OPEN
## 14. Redundancy detection (Concern C1) — RESOLVED
**Depends on:** Task 13 (metadata index).
**Methodology:** OOPS! P2 (synonymous classes) + embedding similarity +
@@ -357,8 +374,9 @@ dedup only checks slug collisions. There is no semantic overlap detection.
- `intensional_conciseness`: `1 - redundancy_ratio`
**CLI:** `--check-redundancy --provider <provider>`
**Fix applied:** `markitect/infospace/checks/redundancy.py`. Exposed via `--infospace-check`.
## 15. Coverage completeness (Concern C2) — OPEN
## 15. Coverage completeness (Concern C2) — RESOLVED
**Depends on:** Task 13 (metadata index).
**Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency
@@ -399,8 +417,9 @@ questions about the economic system.
- `competency_coverage`: fraction of questions answerable
**CLI:** `--check-coverage --provider <provider>`
**Fix applied:** `markitect/infospace/checks/coverage.py`. Exposed via `--infospace-check`.
## 16. Structural coherence (Concern C3) — OPEN
## 16. Structural coherence (Concern C3) — RESOLVED
**Depends on:** Task 13 (metadata index).
**Methodology:** OntoQA relationship richness + graph connectivity +
@@ -440,8 +459,9 @@ between entities.
- `cohesion_by_domain` / `coupling_across_domains`: scalars
**CLI:** `--check-coherence --provider <provider>`
**Fix applied:** `markitect/infospace/checks/coherence.py`. Exposed via `--infospace-check`.
## 17. Definitional consistency (Concern C4) — OPEN
## 17. Definitional consistency (Concern C4) — RESOLVED
**Depends on:** Task 16 (relationship graph — the definitional dependency
graph is a directed variant of the same structure).
@@ -479,8 +499,9 @@ entities but aren't.
- `source_fidelity_score`: fraction passing source check
**CLI:** `--check-consistency --provider <provider>`
**Fix applied:** `markitect/infospace/checks/consistency.py`. Exposed via `--infospace-check`.
## 18. Granularity balance (Concern C5) — OPEN
## 18. Granularity balance (Concern C5) — RESOLVED
**Depends on:** Task 13 (metadata index).
**Methodology:** Keet granularity theory + OntoClean rigidity +
@@ -517,8 +538,9 @@ or whether some entities are too specific/general relative to their peers.
- `split_candidates`: list of entities
**CLI:** `--check-granularity --provider <provider>`
**Fix applied:** `markitect/infospace/checks/granularity.py`. Exposed via `--infospace-check`.
## 19. Unified collection evaluation command — OPEN
## 19. Unified collection evaluation command — RESOLVED
**Depends on:** Tasks 13-18.
**Issue:** Running five separate `--check-*` commands is cumbersome and
@@ -537,6 +559,10 @@ runs all five checks in sequence, sharing infrastructure:
Incremental mode: `--evaluate-collection --chapter <id>` re-evaluates
only entities from that chapter plus pairwise checks involving them.
**Fix applied:** `markitect/infospace/checks/orchestrator.py` + `--infospace-check`
CLI flag. All five checks share the metadata index. Results recorded in
`output/metrics/metrics.yaml` and `output/metrics/history.yaml`.
Report a summary to stdout:
```

View File

@@ -487,14 +487,16 @@ class ChapterProcessor:
if not prompt:
return None
# Write compiled prompt for inspection
prompt_file = self._entities_dir() / f"{chapter_id}-prompt.md"
prompt_file.parent.mkdir(parents=True, exist_ok=True)
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
view_file = self._entities_dir() / f"{chapter_id}-entities.md"
# Write compiled prompt only when no output exists yet (avoids dirty
# working tree on DB-only rebuilds — Task 5 fix)
prompt_file = self._entities_dir() / f"{chapter_id}-prompt.md"
if not (view_file.exists() and "{{ include" in view_file.read_text()):
prompt_file.parent.mkdir(parents=True, exist_ok=True)
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
# ── PRIMARY: chapter view with transclusion already on disk ──
if view_file.exists() and "{{ include" in view_file.read_text():
content, entity_files = self._read_entities_from_view(chapter_id)
@@ -575,11 +577,14 @@ class ChapterProcessor:
if not prompt:
return None
prompt_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-prompt.md"
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
output_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-mappings.md"
# Write compiled prompt only when output does not yet exist (Task 5 fix)
if not output_file.exists():
prompt_file = self.example_dir / "output" / "mappings" / f"{chapter_id}-prompt.md"
prompt_file.parent.mkdir(parents=True, exist_ok=True)
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
if output_file.exists():
content = output_file.read_text()
self.store_output_artifact(
@@ -622,11 +627,14 @@ class ChapterProcessor:
if not prompt:
return None
prompt_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-prompt.md"
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
output_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-analysis.md"
# Write compiled prompt only when output does not yet exist (Task 5 fix)
if not output_file.exists():
prompt_file = self.example_dir / "output" / "analyses" / f"{chapter_id}-prompt.md"
prompt_file.parent.mkdir(parents=True, exist_ok=True)
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
if output_file.exists():
content = output_file.read_text()
self.store_output_artifact(
@@ -679,11 +687,14 @@ class ChapterProcessor:
if not prompt:
return None
prompt_file = self.example_dir / "output" / "metrics" / "metrics-prompt.md"
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
output_file = self.example_dir / "output" / "metrics" / "metrics-report.md"
# Write compiled prompt only when output does not yet exist (Task 5 fix)
if not output_file.exists():
prompt_file = self.example_dir / "output" / "metrics" / "metrics-prompt.md"
prompt_file.parent.mkdir(parents=True, exist_ok=True)
prompt_file.write_text(prompt)
print(f" Prompt written to {prompt_file.relative_to(self.example_dir)}")
if output_file.exists():
content = output_file.read_text()
self.store_output_artifact(
@@ -709,6 +720,123 @@ class ChapterProcessor:
print(f" Awaiting output at: {output_file.relative_to(self.example_dir)}")
return None
# ── Entity Evaluation (Task 9) ────────────────────────────────────
def _extract_quality_rubric(self) -> str:
"""Extract the Quality Metrics section from the entity schema file."""
schema_file = self.example_dir / "schemas" / "economic-entity-schema-v1.0.md"
text = schema_file.read_text()
# Find the ## Quality Metrics section up to the next ## section
import re as _re
m = _re.search(
r"^## Quality Metrics\n(.*?)^## ",
text,
flags=_re.MULTILINE | _re.DOTALL,
)
if m:
return ("## Quality Metrics\n" + m.group(1)).strip()
return text # fallback: whole schema
def _extract_source_chapter_from_entity(self, entity_text: str) -> str:
"""Extract the Source Chapter field from an entity markdown file."""
import re as _re
m = _re.search(
r"^## Source Chapter\s*\n+(.+?)(?:\n\n|\n##|\Z)",
entity_text,
flags=_re.MULTILINE | _re.DOTALL,
)
if m:
return m.group(1).strip()
return "Unknown chapter"
def evaluate_entities(self, chapter_id: Optional[str] = None) -> None:
"""Evaluate canonical entities using the evaluate-entity template.
If *chapter_id* is given, evaluates only entities introduced by that
chapter (determined from the chapter view file). Otherwise evaluates
all canonical entities.
Outputs are written to ``output/evaluations/<slug>-eval.md``.
Existing evaluation files are skipped (idempotent).
"""
evaluations_dir = self.example_dir / "output" / "evaluations"
evaluations_dir.mkdir(parents=True, exist_ok=True)
# Determine which entity files to evaluate
if chapter_id:
view_file = self._entities_dir() / f"{chapter_id}-entities.md"
if not view_file.exists():
print(f" No chapter view found for {chapter_id}")
return
_, entity_files = self._read_entities_from_view(chapter_id)
if not entity_files:
print(f" No entities found for chapter {chapter_id}")
return
print(f"Evaluating {len(entity_files)} entities from {chapter_id}...")
else:
slugs = self._list_existing_entity_names()
entity_files = [(s, self._entities_dir() / f"{s}.md") for s in slugs]
print(f"Evaluating {len(entity_files)} canonical entities...")
if not entity_files:
print(" No entities to evaluate.")
return
# Shared context loaded once
quality_rubric = self._extract_quality_rubric()
self.bind_macro_artifact(self.spaces["guidelines"], "quality_rubric", quality_rubric)
done = 0
skipped = 0
failed = 0
for slug, entity_path in entity_files:
output_file = evaluations_dir / f"{slug}-eval.md"
if output_file.exists():
skipped += 1
continue
if not entity_path.exists():
print(f" MISSING: {entity_path.name}")
failed += 1
continue
entity_text = entity_path.read_text()
source_chapter = self._extract_source_chapter_from_entity(entity_text)
# Bind per-entity macros
self.bind_macro_artifact(self.spaces["entities"], "entity_content", entity_text)
self.bind_macro_artifact(self.spaces["sources"], "source_chapter", source_chapter)
prompt = self.resolve_and_compile(
"evaluate-entity",
["entities", "sources", "vsm-reference", "guidelines"],
)
if not prompt:
print(f" FAILED to compile prompt for {slug}")
failed += 1
continue
# Write prompt only when output does not yet exist (Task 5 fix)
prompt_file = evaluations_dir / f"{slug}-eval-prompt.md"
if not output_file.exists():
prompt_file.write_text(prompt)
if not self.llm_adapter:
print(f" {slug}: prompt written, awaiting manual evaluation")
done += 1
continue
print(f" Evaluating: {slug}...")
content = self._execute_llm(prompt, output_file, f"eval:{slug}", max_tokens=1024)
if content:
done += 1
else:
failed += 1
total = done + skipped + failed
print(f"\nEvaluation complete: {done} done, {skipped} skipped (existing), {failed} failed — {total} total")
# ── Chapter Processing ───────────────────────────────────────────
def process_chapter(self, chapter_id: str, auto_commit: bool = True):
@@ -994,9 +1122,13 @@ def main():
help="Run collection-level quality checks (C1-C5)")
group.add_argument("--infospace-viability", action="store_true",
help="Show viability dashboard")
group.add_argument("--evaluate", action="store_true",
help="Evaluate entity quality using the evaluate-entity template")
parser.add_argument("--reason", type=str, default=None,
help="Reason for archiving (used with --archive-entity)")
parser.add_argument("--eval-chapter", type=str, default=None, metavar="CHAPTER_ID",
help="Limit --evaluate to entities from a specific chapter")
parser.add_argument("--no-commit", action="store_true", help="Skip git commits")
parser.add_argument(
"--provider",
@@ -1064,6 +1196,9 @@ def main():
elif args.infospace_viability:
_run_infospace_viability(example_dir)
return
elif args.evaluate:
processor.evaluate_entities(chapter_id=args.eval_chapter)
return
processor.show_stats()

View File

@@ -39,6 +39,45 @@ this entity. Must be enclosed in quotation marks with chapter reference.
How this entity is understood in modern economic theory, including
any evolution in meaning since Smith's time.
## Quality Metrics
Used by the `evaluate-entity` prompt template to score each entity on five
dimensions. Each dimension is scored 15, where 1 = very poor and 5 = excellent.
### Definition Precision (1-5)
Is the definition specific, non-circular, and clearly distinguishable from
neighbouring concepts? A score of 5 means the definition uniquely identifies
the concept without relying on terms that are themselves undefined within the
infospace. A score of 1 means the definition is vague, tautological, or
indistinguishable from another entity.
### Source Grounding (1-5)
Is the entity grounded in a specific, verifiable passage from the source text?
A score of 5 means a citation is present, the cited chapter exists, and the
definition accurately reflects the cited passage. A score of 1 means no
citation is given or the definition contradicts the source.
### Domain Placement (1-5)
Is the economic domain assignment correct and specific? A score of 5 means
the assigned domain (e.g., Production, Distribution) is the most precise
fit and would not be improved by a different choice. A score of 1 means the
domain is wrong, or "General Theory" is used when a more specific domain
applies.
### VSM Relevance (1-5)
Does this entity connect meaningfully to at least one VSM system (S1S5,
recursion, variety, algedonic signals)? A score of 5 means the entity is
directly mappable to a VSM concept with a clear structural rationale. A
score of 1 means the entity has no discernible VSM connection and may be
too granular or peripheral to the system model.
### Explanatory Value (1-5)
Does this entity contribute to explaining the economic system as a whole, or
is it a restatement of another concept? A score of 5 means removing this
entity would leave a meaningful gap in the infospace. A score of 1 means
another entity already covers this ground, or the entity adds no
explanatory power.
## Validation Rules
1. The document MUST contain an H1 heading with the entity name.

View File

@@ -33,6 +33,25 @@ might not fit the VSM concept perfectly.
Other VSM concepts this entity could plausibly map to,
with brief rationale for each alternative.
## Quality Metrics
Used by the `evaluate-entity` prompt template when assessing mapping quality.
Each dimension is scored 15, where 1 = very poor and 5 = excellent.
### Rationale Rigour (1-5)
Is the mapping justified with reference to Beer's VSM definitions, not just
surface-level analogy? A score of 5 means the rationale cites specific VSM
properties (e.g., "S2 attenuates variety between S1 units") and shows how
the economic entity fulfils that role. A score of 1 means the rationale is
a loose metaphor with no structural grounding.
### Strength Calibration (1-5)
Is the declared Mapping Strength (Strong, Moderate, Weak) consistent with
the rationale given? A score of 5 means the declared strength matches the
depth of correspondence described. A score of 1 means the strength is
overclaimed (e.g., "Strong" for a tangential analogy) or underclaimed
(e.g., "Weak" for a direct structural match).
## Validation Rules
1. The document MUST contain an H1 heading in the format "Entity Name -> VSM Concept Name".