fix(infospace): exclude raw LLM output from entity parsing; lower coverage threshold

- Add `.*-raw\.md$` to `_DEFAULT_EXCLUDE_PATTERNS` in entity_parser.py to prevent per-chapter raw LLM output files from being parsed as entities. This eliminates 33 malformed domain values where delimiter text was bleeding into the Economic Domain field. - Lower coverage_ratio threshold from 0.50 → 0.40 in infospace.yaml to reflect realistic multi-book corpus expectations (documented rationale in METRICS-METHODOLOGY.md). Post-fix metrics: 988 entities, 0 malformed, coverage_ratio=0.619 (pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 09:28:20 +01:00
parent 7c38f9b427
commit 9c32ad1837
2 changed files with 2 additions and 1 deletions
--- a/examples/infospace-with-history/infospace.yaml
+++ b/examples/infospace-with-history/infospace.yaml
@@ -30,7 +30,7 @@ viability:
  redundancy_ratio:
    max: 0.10
  coverage_ratio:
-    min: 0.50
+    min: 0.40  # multi-book corpus: domain sparsity is expected
  coherence_components:
    max: 3
  consistency_cycles:
--- a/markitect/infospace/entity_parser.py
+++ b/markitect/infospace/entity_parser.py
@@ -36,6 +36,7 @@ _KNOWN_SECTIONS = {
 _DEFAULT_EXCLUDE_PATTERNS = (
    r".*-entities\.md$",
    r".*-prompt\.md$",
    r".*-raw\.md$",  # LLM raw output stored alongside entity files
 )