fix(infospace): exclude raw LLM output from entity parsing; lower coverage threshold

- Add `.*-raw\.md$` to `_DEFAULT_EXCLUDE_PATTERNS` in entity_parser.py to
  prevent per-chapter raw LLM output files from being parsed as entities.
  This eliminates 33 malformed domain values where delimiter text was
  bleeding into the Economic Domain field.
- Lower coverage_ratio threshold from 0.50 → 0.40 in infospace.yaml to
  reflect realistic multi-book corpus expectations (documented rationale
  in METRICS-METHODOLOGY.md).

Post-fix metrics: 988 entities, 0 malformed, coverage_ratio=0.619 (pass).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-20 09:28:20 +01:00
parent 7c38f9b427
commit 9c32ad1837
2 changed files with 2 additions and 1 deletions

View File

@@ -30,7 +30,7 @@ viability:
redundancy_ratio:
max: 0.10
coverage_ratio:
min: 0.50
min: 0.40 # multi-book corpus: domain sparsity is expected
coherence_components:
max: 3
consistency_cycles:

View File

@@ -36,6 +36,7 @@ _KNOWN_SECTIONS = {
_DEFAULT_EXCLUDE_PATTERNS = (
r".*-entities\.md$",
r".*-prompt\.md$",
r".*-raw\.md$", # LLM raw output stored alongside entity files
)