Files

tegwick 4ce856d4d0 docs: metrics methodology, collection-level tasks, and infospace tooling roadmap

Add METRICS-METHODOLOGY.md documenting the theoretical frameworks
(SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for
two-layer evaluation (LLM-Eval + deterministic aggregation) across
five collection concerns: redundancy, coverage, coherence, consistency,
and granularity balance.

Extend INFRA-TASKS.md with assignment assessment (tasks 4-7),
per-concept metrics (tasks 8-12), and collection-level metrics
(tasks 13-19).

Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace,
topic, discipline, entity, evaluation, viability) and a three-stage
implementation plan: Stage 1 platform additions, Stage 2 infospace
tooling layer, Stage 3 example revision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-18 23:53:21 +01:00

25 KiB

Raw Blame History

Markitect Infrastructure Tasks

Issues discovered while building the infospace-with-history example. All three have been fixed in commit 706981c and the pipeline script refactored to use the fixed infrastructure directly.

1. Artifact Repository does not store content — RESOLVED

File: markitect/prompts/resolver/resolver.py, line 147-148 Issue: content = f"[Content of {artifact.name} from {space_id}]" — the resolver returns placeholder text instead of actual artifact content because the SQLiteArtifactRepository stores metadata (digest, name, type) but not the content itself. Impact: Consumers must maintain their own content cache alongside the repository, defeating the purpose of centralised artifact storage. Fix applied: Added content field to Artifact model, content TEXT column to SQLite schema (with migration for existing DBs), and replaced the resolver placeholder with artifact.content.

2. ContentMacro raw_text defaults to empty string — RESOLVED

File: markitect/prompts/templates/models.py, line 46 Issue: raw_text: str = "" — when macros are constructed programmatically (not parsed from template text), raw_text defaults to "". The ContextCompiler then calls str.replace("", resolved.content) which inserts content between every character, producing multi-gigabyte output. Impact: Silent data corruption; compiled prompts become unusable. Fix applied: Added __post_init__ to ContentMacro that auto-derives raw_text = f"@{{{self.target}}}" when not provided.

3. No TemplateAnalyzer support for @{target} syntax — RESOLVED

File: markitect/prompts/templates/parser.py Issue: The MacroParser parses {{kind:target}} syntax but the templates in this example use the simplified @{target} syntax. There's no automatic parsing for this format, requiring manual macro construction. Fix applied: Added SHORTHAND_PATTERN to MacroParser that recognises @{target} and maps it to MacroKind.REQUIRED. Updated has_macros(), count_macros(), and find_macro_positions() accordingly.

Assignment Assessment (18 Feb 2026)

How the example measures against the objectives stated in README.md:

#	Objective	Status	Notes
1	Capture knowledge from Wealth of Nations	Partial	7 of 35 chapters processed (Book I, ch. 1-7). 85 canonical entities extracted.
2	Transform to VSM concepts/entities	Done (for processed chapters)	Entities mapped to S1-S5 with strength ratings.
3	Consistent and complete	Not yet	Only 20% of chapters done. Metrics report exists but covers limited scope.
4	Schemas as scaffolding	Done	Four schemas defined and used across all stages.
5	Prompt dependency resolution	Done	`@{macro}` templates resolved via MultiSpaceResolutionStrategy.
6	Incremental chapter injection	Done	Pipeline processes one chapter at a time; `@{existing_entities}` prevents duplication.
7	Keep changes as git history	Not done	See task 4 below.
8	Metrics for completeness/consistency	Partial	Template and report exist but only cover 4 chapters (report predates ch. 5-7).
9	No infrastructure changes during experiment	Violated	Three infra fixes were required (tasks 1-3 above). Documented as intended.
10	Generate task list for infra issues	Done	This file.

4. Infospace has no per-chapter git history — OPEN

Objective: README states "The information space should utilize the option of keeping changes as git history." Issue: The 7 processed chapters were committed in mixed batches alongside infrastructure changes (LLM adapters, entity refactoring, archive policy). Chapters 1-2 are bundled into fecc2fd with the entire LLM module. Chapters 5-7 share a single commit (41773f1) with the OpenAI adapter and archive policy. There is no commit where you can git diff to see exactly what one chapter contributed to the infospace. Impact: Cannot use git log, git diff, or git bisect to trace how the infospace grew chapter by chapter — the core promise of "with history." Suggested fix: Re-run the 7 processed chapters (and remaining 28) using process_chapters.py without --no-commit, on a clean branch or after squashing the current output into a baseline commit. Each chapter gets its own commit via _git_commit_chapter().

5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN

Issue: Running --all --no-commit to regenerate infospace.db also overwrites *-prompt.md files in the output directories because each pipeline stage unconditionally writes the compiled prompt before checking whether output already exists. The @{existing_entities} macro content shifts as earlier chapters are loaded, so prompt files for already-processed chapters change on every full run. Impact: A DB regeneration dirties the working tree with prompt file changes, even though no actual outputs changed. Users must git checkout the prompt files after regeneration. Suggested fix: Skip writing prompt files when the corresponding output file already exists on disk, or add a --rebuild-db-only flag that populates the database without touching the file system.

6. Metrics report is stale — OPEN

Issue: The metrics report (output/metrics/metrics-report.md) was generated after chapters 1-4. Chapters 5-7 have since been processed but the report has not been refreshed. Impact: The metrics do not reflect the current state of the infospace. Suggested fix: Re-run --metrics --provider <provider> --no-commit after every batch of new chapters. Consider making metrics assessment automatic at the end of --book or --all runs.

7. Remaining 28 chapters not yet processed — OPEN

Issue: Only Book I chapters 1-7 have been processed. Books II-V (28 chapters) remain unprocessed. Impact: The infospace is incomplete — VSM coverage is limited to S1, S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic signals, recursion, variety) are expected to emerge from later books. Suggested fix: Process remaining chapters in book-sized batches with per-chapter commits, refreshing metrics after each book.

Per-Concept Metrics (tasks 8-12)

The current metrics system is a single LLM-evaluated narrative report that assesses the infospace as a whole. It produces no machine-readable output, cannot be tracked over time, and conflates per-concept quality with collection-level coherence.

The improvement splits metrics into two layers:

LLM-Eval: A prompt template evaluates each concept individually against quality criteria defined in the schema. The LLM returns structured scores, not prose.
Deterministic aggregation: process_chapters.py computes what it can from files on disk (schema compliance, word counts, section presence, coverage tallies) and aggregates LLM-eval scores into dashboard metrics.

Both layers persist results in structured form so they can be diffed, tracked over time, and committed alongside the entities they evaluate.

8. Add per-concept quality metrics to entity schema — OPEN

Issue: The entity schema (economic-entity-schema-v1.0.md) defines required sections and validation rules (section presence, word count range) but no quality criteria. There is no definition of what makes a good entity versus a merely compliant one. Suggested fix: Add a ## Quality Metrics section to the entity schema defining evaluation dimensions with scoring rubrics:

Definition Precision (1-5): Is the definition specific, non-circular, and distinguishable from neighbouring concepts?
Source Grounding (1-5): Is the entity grounded in a specific passage? Does the citation exist and support the definition?
Domain Placement (1-5): Is the economic domain assignment correct and specific (not just "General Theory")?
VSM Relevance (1-5): Does the entity connect meaningfully to at least one VSM system, or is it too granular/abstract to map?
Explanatory Value (1-5): Does this entity contribute to explaining the economic system, or is it a restatement of another concept?

Similarly update the VSM mapping schema with:

Rationale Rigour (1-5): Is the mapping justified with reference to Beer's definitions, not just surface-level analogy?
Strength Calibration (1-5): Is the declared strength (Strong/Moderate/ Weak) consistent with the rationale given?

These rubrics become the prompt instructions for task 9.

9. Create evaluate-entity prompt template — OPEN

Depends on: Task 8 (quality metrics in schema). Issue: There is no mechanism to evaluate an existing entity after extraction. Quality is only judged implicitly during the global metrics assessment, which is too coarse to identify individual weak entities. Suggested fix: Create templates/evaluate-entity.md — a prompt template that:

Takes @{entity_content}, @{source_chapter}, @{vsm_framework}, and @{quality_rubric} (from the schema's quality metrics section).
Asks the LLM to score each dimension (1-5) with a one-sentence justification per score.
Outputs structured YAML front-matter (scores) followed by markdown (justifications), e.g.:

---
entity: division-of-labour
scores:
  definition_precision: 5
  source_grounding: 5
  domain_placement: 4
  vsm_relevance: 5
  explanatory_value: 5
overall: 4.8
flags: []
---

Add a pipeline stage: --evaluate runs this template against every canonical entity and writes results to output/evaluations/<slug>-eval.md. A --evaluate --chapter <id> variant evaluates only entities introduced by that chapter.

10. Add deterministic schema compliance checker — OPEN

Issue: Schema compliance is currently LLM-evaluated ("100%" in the metrics report) but the validation rules in the schemas are mechanical: section presence, word count ranges, heading format. These should be checked programmatically, not by an LLM. Suggested fix: Add a validate_entity(path) -> ValidationResult function to process_chapters.py (or a new validate.py module) that:

Parses the markdown to extract H2 section headings
Checks required sections are present (Definition, Source Chapter, Context, Economic Domain)
Counts words in the Definition section (must be 20-150)
Checks H1 heading exists and is not a slug (e.g. effectual-demand in chapter 7 has # effectual-demand instead of # Effectual Demand)
Validates Source Chapter cites a specific book/chapter
For mapping files: checks Mapping Strength is one of the enum values

Expose as --validate CLI flag. Output a structured report:

Validation: 85 entities, 3 warnings
  effectual-demand.md: H1 is slug format, not title case
  porter.md: Definition is 18 words (minimum 20)
  ...

This is fully deterministic — no LLM calls needed.

11. Structured metrics output format — OPEN

Depends on: Tasks 9 and 10. Issue: The metrics report is a markdown narrative. Values cannot be parsed programmatically, diffed meaningfully, or plotted over time. Suggested fix: Alongside the human-readable metrics-report.md, emit a machine-readable metrics.yaml (or .json) containing:

timestamp: "2026-02-18T12:00:00Z"
chapters_processed: 7
chapters_total: 35
entities_total: 85
entities_archived: 0
vsm_coverage:
  S1: 28
  S2: 12
  S3: 8
  S3_star: 0
  S4: 5
  S5: 0
  recursion: 1
  variety: 0
mapping_strength:
  strong: 64
  moderate: 18
  weak: 3
validation:
  schema_compliant: 82
  warnings: 3
evaluation:    # from LLM-eval (task 9)
  mean_overall: 4.2
  min_overall: 2.8
  flagged_entities: ["porter", "country-workman"]

The --metrics command writes both files. The YAML file is committed to git so git diff shows exactly how metrics changed between runs.

12. Metrics-over-time tracking — OPEN

Depends on: Task 11 (structured output). Issue: There is one metrics snapshot that gets overwritten. No history of how metrics evolved as chapters were added. Suggested fix: Append each metrics snapshot to a cumulative log file output/metrics/metrics-history.yaml (list of timestamped entries). This is committed to git alongside the current snapshot. The pipeline can optionally render a simple text-based progress summary:

Metrics history (5 snapshots):
  2026-02-10  ch 1/35   13 entities  41.7% VSM coverage
  2026-02-11  ch 4/35   38 entities  50.0% VSM coverage
  2026-02-11  ch 7/35   85 entities  58.3% VSM coverage
  ...

This provides the "metrics that improve over time" feedback loop the README envisions: process chapters → evaluate → see coverage grow (or flag regressions when a re-extraction reduces quality scores).

Collection-Level Metrics (tasks 13-19)

These tasks implement the five collection-level concerns described in METRICS-METHODOLOGY.md. They share underlying infrastructure (entity metadata index, definition embeddings, relationship graph) that should be built once per evaluation run.

See the methodology document for theoretical grounding, framework references, and the full metric definitions per concern.

13. Entity metadata index — deterministic parsing layer — OPEN

Depends on: Task 10 (schema compliance checker shares parsing logic). Issue: Several collection-level metrics (coverage matrix, FCA context, granularity distribution) require structured metadata extracted from entity files: H1 title, economic domain, VSM system(s), source chapter, section presence, word counts. Currently this information exists only as prose inside markdown files. Suggested fix: Add a parse_entity_metadata(path) -> EntityMeta function that extracts from each entity file:

@dataclass
class EntityMeta:
    slug: str
    title: str                  # from H1
    domain: str                 # from Economic Domain section
    source_chapter: str         # from Source Chapter section
    definition_words: int       # word count of Definition section
    has_original_wording: bool  # optional section present?
    has_modern_interpretation: bool
    vsm_systems: list[str]     # from mapping file if exists
    mapping_strengths: list[str]

Build an index of all entities at the start of each evaluation run. This index is the input for tasks 14, 16, and 18. Expose as --index CLI flag for inspection.

14. Redundancy detection (Concern C1) — OPEN

Depends on: Task 13 (metadata index). Methodology: OOPS! P2 (synonymous classes) + embedding similarity + LLM pairwise judgment. See METRICS-METHODOLOGY.md §4 C1. Issue: Entities with different slugs but overlapping meanings (e.g. natural-rate / ordinary-or-average-rate) survive extraction because dedup only checks slug collisions. There is no semantic overlap detection. Suggested fix: Implement in three stages:

Embed — Compute vector embeddings of all entity definitions using an embedding API (OpenRouter, OpenAI, or a local sentence-transformer). Cache embeddings in output/metrics/embeddings.json keyed by {slug: content_digest} so unchanged entities skip re-embedding.
Similarity matrix — Compute NxN cosine similarity. Write the full matrix to output/metrics/similarity-matrix.json. Flag all pairs with cosine > 0.80 as candidates.
LLM pairwise judgment — For each candidate pair, run a prompt: "Given these two entity definitions, are they (a) the same concept and should be merged, (b) genuinely distinct, or (c) partially overlapping and should be clarified?" Write results to output/metrics/redundancy-report.md + YAML.

Metrics produced:

high_similarity_pairs: count and list
confirmed_synonyms: count (LLM-confirmed same concept)
redundancy_ratio: confirmed_synonyms / total_entities
intensional_conciseness: 1 - redundancy_ratio

CLI: --check-redundancy --provider <provider>

15. Coverage completeness (Concern C2) — OPEN

Depends on: Task 13 (metadata index). Methodology: SEQUAL completeness + FCA gap analysis + DSL competency questions. See METRICS-METHODOLOGY.md §4 C2. Issue: Coverage is currently assessed by the LLM in a single narrative pass. There is no structured view of which domain × VSM cells are populated, and no way to test whether the entity set can answer specific questions about the economic system. Suggested fix: Implement in three stages:

Domain × VSM matrix — From the metadata index, count entities per {economic_domain, vsm_system} cell. Render as a table. Identify empty cells as specific, actionable gaps. Compute:
- coverage_ratio = populated_cells / total_cells
- vsm_balance_entropy = -Σ(pᵢ log pᵢ) across VSM systems
FCA lattice — Construct a formal context with objects = entities, attributes = {domain, vsm_system, source_book, abstraction_level}. Compute the concept lattice (Python concepts library). Extract attribute combinations with no corresponding entity — these are structural coverage gaps not visible in the simple matrix.
Competency questions — Define a set of 15-20 canonical questions the infospace should answer (stored in schemas/competency-questions.md). Example questions:
- "How does the division of labour relate to market extent?"
- "What mechanisms regulate wages toward their natural rate?"
- "How do monopolies distort the viable system?" LLM-Eval tests whether current entities suffice to answer each. Unanswerable questions identify specific completeness gaps.

Metrics produced:

domain_vsm_matrix: cell counts
coverage_ratio: scalar
vsm_balance_entropy: scalar
empty_cells: list of {domain, vsm_system} gaps
fca_gap_concepts: attribute combos with no entity
competency_coverage: fraction of questions answerable

CLI: --check-coverage --provider <provider>

16. Structural coherence (Concern C3) — OPEN

Depends on: Task 13 (metadata index). Methodology: OntoQA relationship richness + graph connectivity + community detection. See METRICS-METHODOLOGY.md §4 C3. Issue: It is unknown whether the 85 entities form a connected explanatory web or a fragmented collection. No relationship graph exists between entities. Suggested fix: Implement in three stages:

Explicit cross-references — Scan each entity's definition for mentions of other entity slugs or titles (normalised string matching). This is deterministic and catches direct references.
LLM-inferred edges — For entity pairs not caught by string matching but in the same domain or VSM system, LLM-Eval: "Does A's definition conceptually depend on or explain B, or vice versa?" Run in batches. Write the combined graph to output/metrics/relationship-graph.json (adjacency list).
Graph analysis — Using networkx or equivalent:
- Connected components (target: 1)
- Graph density, average degree
- Betweenness centrality → identify bridge concepts
- Louvain community detection → compare to declared domains
- OntoQA Relationship Richness
- Cohesion per domain, coupling across domains
- Orphan entities (degree 0 or 1)

Metrics produced:

connected_components: count (target: 1)
graph_density: scalar
avg_degree: scalar
relationship_richness: OntoQA RR
modularity: Louvain score
bridge_concepts: list (high betweenness centrality)
orphan_entities: list (degree ≤ 1)
cohesion_by_domain / coupling_across_domains: scalars

CLI: --check-coherence --provider <provider>

17. Definitional consistency (Concern C4) — OPEN

Depends on: Task 16 (relationship graph — the definitional dependency graph is a directed variant of the same structure). Methodology: OntoClean metaproperties + OOPS! P24 (circular definitions) + SEQUAL validity. See METRICS-METHODOLOGY.md §4 C4. Issue: No mechanism to detect circular definitions, contradictions between related entities, or terms used in definitions that should be entities but aren't. Suggested fix: Implement in four stages:

Definitional dependency graph — Directed version of the relationship graph: edge A→B means A's definition uses B's concept. Reuse cross-reference extraction from task 16.
Cycle detection — Find all cycles of length ≤ 3 in the directed graph. Short cycles are problematic (A defines B, B defines A). Compute grounding_ratio: fraction of entities traceable to terms outside the entity set without encountering a cycle.
Undefined dependencies — Extract terms from definitions that match entity-name patterns (capitalised noun phrases, kebab-case slugs) but have no corresponding entity file. These are concepts the infospace implicitly relies on but hasn't defined.
LLM consistency checks — For directly-connected entity pairs, LLM-Eval: "Do these definitions contradict each other?" For entities with Smith's Original Wording, LLM-Eval: "Does the definition accurately represent the cited passage?"

Metrics produced:

circular_definitions: count and list of cycles (length ≤ 3)
grounding_ratio: fraction of entities reaching primitives
undefined_dependencies: list of missing terms
contradiction_candidates: LLM-flagged pairs
source_fidelity_score: fraction passing source check

CLI: --check-consistency --provider <provider>

18. Granularity balance (Concern C5) — OPEN

Depends on: Task 13 (metadata index). Methodology: Keet granularity theory + OntoClean rigidity + DSL laconicity. See METRICS-METHODOLOGY.md §4 C5. Issue: Entities range from broad sectors (agriculture) to specific market roles (effectual-demanders) to abstract principles (division-of-labour). It is unclear whether this range is appropriate or whether some entities are too specific/general relative to their peers. Suggested fix: Implement in three stages:

LLM classification — For each entity, LLM-Eval assigns:
- Abstraction level: theory / mechanism / observation
- Scope score: 1-5 (very specific → very general)
- Indispensability: 1-5 ("if removed, how much explanatory power lost?") Write to output/evaluations/<slug>-classification.yaml.
Distribution analysis — Deterministic:
- Count per abstraction level; compute entropy
- Per-domain scope variance (flag domains with high variance)
- Level × domain matrix (from FCA context in task 15)
- Outlier detection: entities > 1.5σ from their domain's mean scope
Merge/split recommendations — For outlier entities, LLM-Eval: "Should this entity be merged into a broader concept, split into sub-concepts, or is its current granularity justified?" For entities with indispensability ≤ 2: "Could another entity serve this purpose?"

Metrics produced:

abstraction_distribution: {theory: n, mechanism: n, observation: n}
abstraction_entropy: scalar (higher = more balanced)
scope_variance_by_domain: per-domain scalar
dispensable_entities: list (indispensability ≤ 2)
merge_candidates: list of pairs
split_candidates: list of entities

CLI: --check-granularity --provider <provider>

19. Unified collection evaluation command — OPEN

Depends on: Tasks 13-18. Issue: Running five separate --check-* commands is cumbersome and repeats shared computation (metadata parsing, embedding, graph building). Suggested fix: Add --evaluate-collection --provider <provider> that runs all five checks in sequence, sharing infrastructure:

Parse entity metadata index (task 13) — used by all
Compute embeddings (task 14) — used by C1, C3
Build relationship graph (task 16) — used by C3, C4
Run all five concern checks
Write per-concern reports to output/metrics/
Write unified metrics.yaml with all collection metrics
Append to metrics-history.yaml (task 12)

Incremental mode: --evaluate-collection --chapter <id> re-evaluates only entities from that chapter plus pairwise checks involving them.

Report a summary to stdout:

Collection evaluation (85 entities, 7 chapters):
  Redundancy:   3 synonym candidates, conciseness 0.96
  Coverage:     58% VSM, 20% chapters, 4 domain gaps
  Coherence:    1 component, density 0.12, 2 orphans
  Consistency:  0 cycles, 5 undefined deps, 0 contradictions
  Granularity:  entropy 1.42, 1 dispensable, 2 merge candidates

25 KiB Raw Blame History Unescape Escape

Markitect Infrastructure Tasks

1. Artifact Repository does not store content — RESOLVED

2. ContentMacro raw_text defaults to empty string — RESOLVED

3. No TemplateAnalyzer support for @{target} syntax — RESOLVED

Assignment Assessment (18 Feb 2026)

4. Infospace has no per-chapter git history — OPEN

5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN

6. Metrics report is stale — OPEN

7. Remaining 28 chapters not yet processed — OPEN

Per-Concept Metrics (tasks 8-12)

8. Add per-concept quality metrics to entity schema — OPEN

9. Create evaluate-entity prompt template — OPEN

10. Add deterministic schema compliance checker — OPEN

11. Structured metrics output format — OPEN

12. Metrics-over-time tracking — OPEN

Collection-Level Metrics (tasks 13-19)

13. Entity metadata index — deterministic parsing layer — OPEN

14. Redundancy detection (Concern C1) — OPEN

15. Coverage completeness (Concern C2) — OPEN

16. Structural coherence (Concern C3) — OPEN

17. Definitional consistency (Concern C4) — OPEN

18. Granularity balance (Concern C5) — OPEN

19. Unified collection evaluation command — OPEN

25 KiB

Raw Blame History