From 4ce856d4d0e9fdce9ebdaacc54bec1d6b9822c6d Mon Sep 17 00:00:00 2001 From: tegwick Date: Wed, 18 Feb 2026 23:53:21 +0100 Subject: [PATCH] docs: metrics methodology, collection-level tasks, and infospace tooling roadmap Add METRICS-METHODOLOGY.md documenting the theoretical frameworks (SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for two-layer evaluation (LLM-Eval + deterministic aggregation) across five collection concerns: redundancy, coverage, coherence, consistency, and granularity balance. Extend INFRA-TASKS.md with assignment assessment (tasks 4-7), per-concept metrics (tasks 8-12), and collection-level metrics (tasks 13-19). Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace, topic, discipline, entity, evaluation, viability) and a three-stage implementation plan: Stage 1 platform additions, Stage 2 infospace tooling layer, Stage 3 example revision. Co-Authored-By: Claude Opus 4.6 --- .../infospace-with-history/INFRA-TASKS.md | 510 ++++++++++++++ .../METRICS-METHODOLOGY.md | 501 ++++++++++++++ roadmap/infospace-tooling/PLAN.md | 621 ++++++++++++++++++ 3 files changed, 1632 insertions(+) create mode 100644 examples/infospace-with-history/METRICS-METHODOLOGY.md create mode 100644 roadmap/infospace-tooling/PLAN.md diff --git a/examples/infospace-with-history/INFRA-TASKS.md b/examples/infospace-with-history/INFRA-TASKS.md index 2febf231..2d848b9e 100644 --- a/examples/infospace-with-history/INFRA-TASKS.md +++ b/examples/infospace-with-history/INFRA-TASKS.md @@ -37,3 +37,513 @@ no automatic parsing for this format, requiring manual macro construction. **Fix applied:** Added `SHORTHAND_PATTERN` to `MacroParser` that recognises `@{target}` and maps it to `MacroKind.REQUIRED`. Updated `has_macros()`, `count_macros()`, and `find_macro_positions()` accordingly. + +--- + +## Assignment Assessment (18 Feb 2026) + +How the example measures against the objectives stated in `README.md`: + +| # | Objective | Status | Notes | +|---|-----------|--------|-------| +| 1 | Capture knowledge from Wealth of Nations | **Partial** | 7 of 35 chapters processed (Book I, ch. 1-7). 85 canonical entities extracted. | +| 2 | Transform to VSM concepts/entities | **Done (for processed chapters)** | Entities mapped to S1-S5 with strength ratings. | +| 3 | Consistent and complete | **Not yet** | Only 20% of chapters done. Metrics report exists but covers limited scope. | +| 4 | Schemas as scaffolding | **Done** | Four schemas defined and used across all stages. | +| 5 | Prompt dependency resolution | **Done** | `@{macro}` templates resolved via MultiSpaceResolutionStrategy. | +| 6 | Incremental chapter injection | **Done** | Pipeline processes one chapter at a time; `@{existing_entities}` prevents duplication. | +| 7 | Keep changes as git history | **Not done** | See task 4 below. | +| 8 | Metrics for completeness/consistency | **Partial** | Template and report exist but only cover 4 chapters (report predates ch. 5-7). | +| 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. | +| 10 | Generate task list for infra issues | **Done** | This file. | + +## 4. Infospace has no per-chapter git history — OPEN + +**Objective:** README states "The information space should utilize the option +of keeping changes as git history." +**Issue:** The 7 processed chapters were committed in mixed batches alongside +infrastructure changes (LLM adapters, entity refactoring, archive policy). +Chapters 1-2 are bundled into `fecc2fd` with the entire LLM module. +Chapters 5-7 share a single commit (`41773f1`) with the OpenAI adapter and +archive policy. There is no commit where you can `git diff` to see exactly +what one chapter contributed to the infospace. +**Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how +the infospace grew chapter by chapter — the core promise of "with history." +**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using +`process_chapters.py` without `--no-commit`, on a clean branch or after +squashing the current output into a baseline commit. Each chapter gets its +own commit via `_git_commit_chapter()`. + +## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN + +**Issue:** Running `--all --no-commit` to regenerate `infospace.db` also +overwrites `*-prompt.md` files in the output directories because each +pipeline stage unconditionally writes the compiled prompt before checking +whether output already exists. The `@{existing_entities}` macro content +shifts as earlier chapters are loaded, so prompt files for already-processed +chapters change on every full run. +**Impact:** A DB regeneration dirties the working tree with prompt file +changes, even though no actual outputs changed. Users must `git checkout` +the prompt files after regeneration. +**Suggested fix:** Skip writing prompt files when the corresponding output +file already exists on disk, or add a `--rebuild-db-only` flag that +populates the database without touching the file system. + +## 6. Metrics report is stale — OPEN + +**Issue:** The metrics report (`output/metrics/metrics-report.md`) was +generated after chapters 1-4. Chapters 5-7 have since been processed but +the report has not been refreshed. +**Impact:** The metrics do not reflect the current state of the infospace. +**Suggested fix:** Re-run `--metrics --provider --no-commit` +after every batch of new chapters. Consider making metrics assessment +automatic at the end of `--book` or `--all` runs. + +## 7. Remaining 28 chapters not yet processed — OPEN + +**Issue:** Only Book I chapters 1-7 have been processed. Books II-V +(28 chapters) remain unprocessed. +**Impact:** The infospace is incomplete — VSM coverage is limited to S1, +S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic +signals, recursion, variety) are expected to emerge from later books. +**Suggested fix:** Process remaining chapters in book-sized batches with +per-chapter commits, refreshing metrics after each book. + +--- + +## Per-Concept Metrics (tasks 8-12) + +The current metrics system is a single LLM-evaluated narrative report that +assesses the infospace as a whole. It produces no machine-readable output, +cannot be tracked over time, and conflates per-concept quality with +collection-level coherence. + +The improvement splits metrics into two layers: + +- **LLM-Eval**: A prompt template evaluates each concept individually + against quality criteria defined in the schema. The LLM returns structured + scores, not prose. +- **Deterministic aggregation**: `process_chapters.py` computes what it can + from files on disk (schema compliance, word counts, section presence, + coverage tallies) and aggregates LLM-eval scores into dashboard metrics. + +Both layers persist results in structured form so they can be diffed, +tracked over time, and committed alongside the entities they evaluate. + +## 8. Add per-concept quality metrics to entity schema — OPEN + +**Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines +required sections and validation rules (section presence, word count range) +but no quality criteria. There is no definition of what makes a *good* +entity versus a merely *compliant* one. +**Suggested fix:** Add a `## Quality Metrics` section to the entity schema +defining evaluation dimensions with scoring rubrics: + +- **Definition Precision** (1-5): Is the definition specific, non-circular, + and distinguishable from neighbouring concepts? +- **Source Grounding** (1-5): Is the entity grounded in a specific passage? + Does the citation exist and support the definition? +- **Domain Placement** (1-5): Is the economic domain assignment correct and + specific (not just "General Theory")? +- **VSM Relevance** (1-5): Does the entity connect meaningfully to at least + one VSM system, or is it too granular/abstract to map? +- **Explanatory Value** (1-5): Does this entity contribute to explaining + the economic system, or is it a restatement of another concept? + +Similarly update the VSM mapping schema with: + +- **Rationale Rigour** (1-5): Is the mapping justified with reference to + Beer's definitions, not just surface-level analogy? +- **Strength Calibration** (1-5): Is the declared strength (Strong/Moderate/ + Weak) consistent with the rationale given? + +These rubrics become the prompt instructions for task 9. + +## 9. Create evaluate-entity prompt template — OPEN + +**Depends on:** Task 8 (quality metrics in schema). +**Issue:** There is no mechanism to evaluate an existing entity after +extraction. Quality is only judged implicitly during the global metrics +assessment, which is too coarse to identify individual weak entities. +**Suggested fix:** Create `templates/evaluate-entity.md` — a prompt +template that: + +1. Takes `@{entity_content}`, `@{source_chapter}`, `@{vsm_framework}`, + and `@{quality_rubric}` (from the schema's quality metrics section). +2. Asks the LLM to score each dimension (1-5) with a one-sentence + justification per score. +3. Outputs structured YAML front-matter (scores) followed by markdown + (justifications), e.g.: + +```yaml +--- +entity: division-of-labour +scores: + definition_precision: 5 + source_grounding: 5 + domain_placement: 4 + vsm_relevance: 5 + explanatory_value: 5 +overall: 4.8 +flags: [] +--- +``` + +Add a pipeline stage: `--evaluate` runs this template against every +canonical entity and writes results to `output/evaluations/-eval.md`. +A `--evaluate --chapter ` variant evaluates only entities introduced +by that chapter. + +## 10. Add deterministic schema compliance checker — OPEN + +**Issue:** Schema compliance is currently LLM-evaluated ("100%" in the +metrics report) but the validation rules in the schemas are mechanical: +section presence, word count ranges, heading format. These should be +checked programmatically, not by an LLM. +**Suggested fix:** Add a `validate_entity(path) -> ValidationResult` +function to `process_chapters.py` (or a new `validate.py` module) that: + +- Parses the markdown to extract H2 section headings +- Checks required sections are present (Definition, Source Chapter, + Context, Economic Domain) +- Counts words in the Definition section (must be 20-150) +- Checks H1 heading exists and is not a slug (e.g. `effectual-demand` + in chapter 7 has `# effectual-demand` instead of `# Effectual Demand`) +- Validates Source Chapter cites a specific book/chapter +- For mapping files: checks Mapping Strength is one of the enum values + +Expose as `--validate` CLI flag. Output a structured report: + +``` +Validation: 85 entities, 3 warnings + effectual-demand.md: H1 is slug format, not title case + porter.md: Definition is 18 words (minimum 20) + ... +``` + +This is fully deterministic — no LLM calls needed. + +## 11. Structured metrics output format — OPEN + +**Depends on:** Tasks 9 and 10. +**Issue:** The metrics report is a markdown narrative. Values cannot be +parsed programmatically, diffed meaningfully, or plotted over time. +**Suggested fix:** Alongside the human-readable `metrics-report.md`, +emit a machine-readable `metrics.yaml` (or `.json`) containing: + +```yaml +timestamp: "2026-02-18T12:00:00Z" +chapters_processed: 7 +chapters_total: 35 +entities_total: 85 +entities_archived: 0 +vsm_coverage: + S1: 28 + S2: 12 + S3: 8 + S3_star: 0 + S4: 5 + S5: 0 + recursion: 1 + variety: 0 +mapping_strength: + strong: 64 + moderate: 18 + weak: 3 +validation: + schema_compliant: 82 + warnings: 3 +evaluation: # from LLM-eval (task 9) + mean_overall: 4.2 + min_overall: 2.8 + flagged_entities: ["porter", "country-workman"] +``` + +The `--metrics` command writes both files. The YAML file is committed +to git so `git diff` shows exactly how metrics changed between runs. + +## 12. Metrics-over-time tracking — OPEN + +**Depends on:** Task 11 (structured output). +**Issue:** There is one metrics snapshot that gets overwritten. No history +of how metrics evolved as chapters were added. +**Suggested fix:** Append each metrics snapshot to a cumulative log file +`output/metrics/metrics-history.yaml` (list of timestamped entries). This +is committed to git alongside the current snapshot. The pipeline can +optionally render a simple text-based progress summary: + +``` +Metrics history (5 snapshots): + 2026-02-10 ch 1/35 13 entities 41.7% VSM coverage + 2026-02-11 ch 4/35 38 entities 50.0% VSM coverage + 2026-02-11 ch 7/35 85 entities 58.3% VSM coverage + ... +``` + +This provides the "metrics that improve over time" feedback loop the +README envisions: process chapters → evaluate → see coverage grow (or +flag regressions when a re-extraction reduces quality scores). + +--- + +## Collection-Level Metrics (tasks 13-19) + +These tasks implement the five collection-level concerns described in +`METRICS-METHODOLOGY.md`. They share underlying infrastructure (entity +metadata index, definition embeddings, relationship graph) that should +be built once per evaluation run. + +See the methodology document for theoretical grounding, framework +references, and the full metric definitions per concern. + +## 13. Entity metadata index — deterministic parsing layer — OPEN + +**Depends on:** Task 10 (schema compliance checker shares parsing logic). +**Issue:** Several collection-level metrics (coverage matrix, FCA context, +granularity distribution) require structured metadata extracted from entity +files: H1 title, economic domain, VSM system(s), source chapter, section +presence, word counts. Currently this information exists only as prose +inside markdown files. +**Suggested fix:** Add a `parse_entity_metadata(path) -> EntityMeta` +function that extracts from each entity file: + +```python +@dataclass +class EntityMeta: + slug: str + title: str # from H1 + domain: str # from Economic Domain section + source_chapter: str # from Source Chapter section + definition_words: int # word count of Definition section + has_original_wording: bool # optional section present? + has_modern_interpretation: bool + vsm_systems: list[str] # from mapping file if exists + mapping_strengths: list[str] +``` + +Build an index of all entities at the start of each evaluation run. +This index is the input for tasks 14, 16, and 18. Expose as +`--index` CLI flag for inspection. + +## 14. Redundancy detection (Concern C1) — OPEN + +**Depends on:** Task 13 (metadata index). +**Methodology:** OOPS! P2 (synonymous classes) + embedding similarity + +LLM pairwise judgment. See METRICS-METHODOLOGY.md §4 C1. +**Issue:** Entities with different slugs but overlapping meanings (e.g. +`natural-rate` / `ordinary-or-average-rate`) survive extraction because +dedup only checks slug collisions. There is no semantic overlap detection. +**Suggested fix:** Implement in three stages: + +1. **Embed** — Compute vector embeddings of all entity definitions using + an embedding API (OpenRouter, OpenAI, or a local sentence-transformer). + Cache embeddings in `output/metrics/embeddings.json` keyed by + `{slug: content_digest}` so unchanged entities skip re-embedding. + +2. **Similarity matrix** — Compute NxN cosine similarity. Write the full + matrix to `output/metrics/similarity-matrix.json`. Flag all pairs with + cosine > 0.80 as candidates. + +3. **LLM pairwise judgment** — For each candidate pair, run a prompt: + "Given these two entity definitions, are they (a) the same concept and + should be merged, (b) genuinely distinct, or (c) partially overlapping + and should be clarified?" Write results to + `output/metrics/redundancy-report.md` + YAML. + +**Metrics produced:** +- `high_similarity_pairs`: count and list +- `confirmed_synonyms`: count (LLM-confirmed same concept) +- `redundancy_ratio`: `confirmed_synonyms / total_entities` +- `intensional_conciseness`: `1 - redundancy_ratio` + +**CLI:** `--check-redundancy --provider ` + +## 15. Coverage completeness (Concern C2) — OPEN + +**Depends on:** Task 13 (metadata index). +**Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency +questions. See METRICS-METHODOLOGY.md §4 C2. +**Issue:** Coverage is currently assessed by the LLM in a single narrative +pass. There is no structured view of which domain × VSM cells are +populated, and no way to test whether the entity set can answer specific +questions about the economic system. +**Suggested fix:** Implement in three stages: + +1. **Domain × VSM matrix** — From the metadata index, count entities per + {economic_domain, vsm_system} cell. Render as a table. Identify empty + cells as specific, actionable gaps. Compute: + - `coverage_ratio = populated_cells / total_cells` + - `vsm_balance_entropy = -Σ(pᵢ log pᵢ)` across VSM systems + +2. **FCA lattice** — Construct a formal context with objects = entities, + attributes = {domain, vsm_system, source_book, abstraction_level}. + Compute the concept lattice (Python `concepts` library). Extract + attribute combinations with no corresponding entity — these are + **structural coverage gaps** not visible in the simple matrix. + +3. **Competency questions** — Define a set of 15-20 canonical questions + the infospace should answer (stored in + `schemas/competency-questions.md`). Example questions: + - "How does the division of labour relate to market extent?" + - "What mechanisms regulate wages toward their natural rate?" + - "How do monopolies distort the viable system?" + LLM-Eval tests whether current entities suffice to answer each. + Unanswerable questions identify specific completeness gaps. + +**Metrics produced:** +- `domain_vsm_matrix`: cell counts +- `coverage_ratio`: scalar +- `vsm_balance_entropy`: scalar +- `empty_cells`: list of {domain, vsm_system} gaps +- `fca_gap_concepts`: attribute combos with no entity +- `competency_coverage`: fraction of questions answerable + +**CLI:** `--check-coverage --provider ` + +## 16. Structural coherence (Concern C3) — OPEN + +**Depends on:** Task 13 (metadata index). +**Methodology:** OntoQA relationship richness + graph connectivity + +community detection. See METRICS-METHODOLOGY.md §4 C3. +**Issue:** It is unknown whether the 85 entities form a connected +explanatory web or a fragmented collection. No relationship graph exists +between entities. +**Suggested fix:** Implement in three stages: + +1. **Explicit cross-references** — Scan each entity's definition for + mentions of other entity slugs or titles (normalised string matching). + This is deterministic and catches direct references. + +2. **LLM-inferred edges** — For entity pairs not caught by string + matching but in the same domain or VSM system, LLM-Eval: "Does A's + definition conceptually depend on or explain B, or vice versa?" Run + in batches. Write the combined graph to + `output/metrics/relationship-graph.json` (adjacency list). + +3. **Graph analysis** — Using networkx or equivalent: + - Connected components (target: 1) + - Graph density, average degree + - Betweenness centrality → identify bridge concepts + - Louvain community detection → compare to declared domains + - OntoQA Relationship Richness + - Cohesion per domain, coupling across domains + - Orphan entities (degree 0 or 1) + +**Metrics produced:** +- `connected_components`: count (target: 1) +- `graph_density`: scalar +- `avg_degree`: scalar +- `relationship_richness`: OntoQA RR +- `modularity`: Louvain score +- `bridge_concepts`: list (high betweenness centrality) +- `orphan_entities`: list (degree ≤ 1) +- `cohesion_by_domain` / `coupling_across_domains`: scalars + +**CLI:** `--check-coherence --provider ` + +## 17. Definitional consistency (Concern C4) — OPEN + +**Depends on:** Task 16 (relationship graph — the definitional dependency +graph is a directed variant of the same structure). +**Methodology:** OntoClean metaproperties + OOPS! P24 (circular +definitions) + SEQUAL validity. See METRICS-METHODOLOGY.md §4 C4. +**Issue:** No mechanism to detect circular definitions, contradictions +between related entities, or terms used in definitions that should be +entities but aren't. +**Suggested fix:** Implement in four stages: + +1. **Definitional dependency graph** — Directed version of the + relationship graph: edge A→B means A's definition uses B's concept. + Reuse cross-reference extraction from task 16. + +2. **Cycle detection** — Find all cycles of length ≤ 3 in the directed + graph. Short cycles are problematic (A defines B, B defines A). + Compute `grounding_ratio`: fraction of entities traceable to terms + outside the entity set without encountering a cycle. + +3. **Undefined dependencies** — Extract terms from definitions that match + entity-name patterns (capitalised noun phrases, kebab-case slugs) but + have no corresponding entity file. These are concepts the infospace + implicitly relies on but hasn't defined. + +4. **LLM consistency checks** — For directly-connected entity pairs, + LLM-Eval: "Do these definitions contradict each other?" For entities + with Smith's Original Wording, LLM-Eval: "Does the definition + accurately represent the cited passage?" + +**Metrics produced:** +- `circular_definitions`: count and list of cycles (length ≤ 3) +- `grounding_ratio`: fraction of entities reaching primitives +- `undefined_dependencies`: list of missing terms +- `contradiction_candidates`: LLM-flagged pairs +- `source_fidelity_score`: fraction passing source check + +**CLI:** `--check-consistency --provider ` + +## 18. Granularity balance (Concern C5) — OPEN + +**Depends on:** Task 13 (metadata index). +**Methodology:** Keet granularity theory + OntoClean rigidity + +DSL laconicity. See METRICS-METHODOLOGY.md §4 C5. +**Issue:** Entities range from broad sectors (`agriculture`) to specific +market roles (`effectual-demanders`) to abstract principles +(`division-of-labour`). It is unclear whether this range is appropriate +or whether some entities are too specific/general relative to their peers. +**Suggested fix:** Implement in three stages: + +1. **LLM classification** — For each entity, LLM-Eval assigns: + - Abstraction level: `theory` / `mechanism` / `observation` + - Scope score: 1-5 (very specific → very general) + - Indispensability: 1-5 ("if removed, how much explanatory power lost?") + Write to `output/evaluations/-classification.yaml`. + +2. **Distribution analysis** — Deterministic: + - Count per abstraction level; compute entropy + - Per-domain scope variance (flag domains with high variance) + - Level × domain matrix (from FCA context in task 15) + - Outlier detection: entities > 1.5σ from their domain's mean scope + +3. **Merge/split recommendations** — For outlier entities, LLM-Eval: + "Should this entity be merged into a broader concept, split into + sub-concepts, or is its current granularity justified?" For entities + with indispensability ≤ 2: "Could another entity serve this purpose?" + +**Metrics produced:** +- `abstraction_distribution`: {theory: n, mechanism: n, observation: n} +- `abstraction_entropy`: scalar (higher = more balanced) +- `scope_variance_by_domain`: per-domain scalar +- `dispensable_entities`: list (indispensability ≤ 2) +- `merge_candidates`: list of pairs +- `split_candidates`: list of entities + +**CLI:** `--check-granularity --provider ` + +## 19. Unified collection evaluation command — OPEN + +**Depends on:** Tasks 13-18. +**Issue:** Running five separate `--check-*` commands is cumbersome and +repeats shared computation (metadata parsing, embedding, graph building). +**Suggested fix:** Add `--evaluate-collection --provider ` that +runs all five checks in sequence, sharing infrastructure: + +1. Parse entity metadata index (task 13) — used by all +2. Compute embeddings (task 14) — used by C1, C3 +3. Build relationship graph (task 16) — used by C3, C4 +4. Run all five concern checks +5. Write per-concern reports to `output/metrics/` +6. Write unified `metrics.yaml` with all collection metrics +7. Append to `metrics-history.yaml` (task 12) + +Incremental mode: `--evaluate-collection --chapter ` re-evaluates +only entities from that chapter plus pairwise checks involving them. + +Report a summary to stdout: + +``` +Collection evaluation (85 entities, 7 chapters): + Redundancy: 3 synonym candidates, conciseness 0.96 + Coverage: 58% VSM, 20% chapters, 4 domain gaps + Coherence: 1 component, density 0.12, 2 orphans + Consistency: 0 cycles, 5 undefined deps, 0 contradictions + Granularity: entropy 1.42, 1 dispensable, 2 merge candidates +``` diff --git a/examples/infospace-with-history/METRICS-METHODOLOGY.md b/examples/infospace-with-history/METRICS-METHODOLOGY.md new file mode 100644 index 00000000..1b89c27f --- /dev/null +++ b/examples/infospace-with-history/METRICS-METHODOLOGY.md @@ -0,0 +1,501 @@ +# Collection-Level Metrics Methodology + +How we evaluate the quality of the infospace as a **collection of +interrelated concepts**, beyond the quality of individual entities. + +This document describes the theoretical frameworks drawn from ontology +engineering, formal concept analysis, semiotic quality theory, and DSL +design — and how each is adapted to work within MarkiTect's two-layer +evaluation model (LLM-Eval + deterministic aggregation). + +--- + +## 1. The Two-Layer Model + +Every metric in this methodology decomposes into two layers: + +| Layer | What it does | How it runs | +|-------|-------------|-------------| +| **LLM-Eval** | Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" | Prompt template → LLM → structured YAML output | +| **Deterministic** | Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection | Python code in `process_chapters.py` or dedicated `metrics.py` | + +The LLM-Eval layer produces **per-entity** or **per-pair** structured +scores. The deterministic layer **aggregates** these into collection-level +metrics, persisted as machine-readable YAML alongside human-readable +markdown reports. + +Per-concept quality metrics (definition precision, source grounding, VSM +relevance — see INFRA-TASKS 8-12) operate at the individual entity level. +This document covers the five **collection-level concerns** that assess how +the entities work together as an explanatory system. + +--- + +## 2. Five Collection-Level Concerns + +### Overview + +| # | Concern | Question | Primary framework | +|---|---------|----------|-------------------| +| C1 | Semantic Overlap | Are there redundant concepts? | OOPS! P2, embedding similarity | +| C2 | Coverage Completeness | Does the concept set cover the domain? | SEQUAL, FCA | +| C3 | Structural Coherence | Do concepts form a connected explanatory graph? | OntoQA, graph theory | +| C4 | Definitional Consistency | Are concepts defined consistently and non-circularly? | OntoClean, OOPS! P24 | +| C5 | Granularity Balance | Are concepts at comparable levels of abstraction? | Granularity theory, DSL laconicity | + +--- + +## 3. Theoretical Frameworks + +### 3.1 SEQUAL (Semiotic Quality Framework) + +**Origin:** Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al. + +**What it defines:** Quality of a conceptual model as the correspondence +between three worlds — the domain (what exists), the model (what we +captured), and the audience's interpretation (what they understand). + +Two key dimensions of **semantic quality**: + +- **Validity** — everything in the model corresponds to something real + in the domain. No invented concepts. +- **Completeness** — everything relevant in the domain is represented in + the model. No missing concepts. + +**How we use it:** SEQUAL frames our entire metrics approach. Every +collection-level metric maps to one of these dimensions: + +| SEQUAL dimension | Our concerns | +|-----------------|--------------| +| Validity | C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid) | +| Completeness | C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps) | +| Both | C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity]) | + +**Adaptation:** SEQUAL was designed for formal models evaluated by human +experts. We replace human judgment with LLM-Eval (for validity checks like +"does this concept correspond to something Smith actually described?") and +deterministic counting (for completeness checks like "which VSM systems +lack entity mappings?"). + +### 3.2 OntoClean + +**Origin:** Guarino & Welty (2004). + +**What it defines:** A methodology for validating taxonomic relationships +by assigning **metaproperties** to each concept: + +- **Rigidity** — Is the property essential to all its instances? (e.g. + "market" is rigid; "effectual demander" is anti-rigid — an agent can + stop being an effectual demander) +- **Identity** — Does the concept carry an identity criterion? (e.g. + "division of labour" can be identified by its three causal mechanisms) +- **Unity** — Are all instances of this concept whole in the same way? +- **Dependence** — Does the concept require another concept to exist? + (e.g. "market price" depends on "effectual demand") + +**Constraint:** A rigid concept cannot be subsumed by an anti-rigid one. +Violations indicate structural confusion. + +**How we use it:** We do not have a formal taxonomy, but our flat entity +set implicitly contains subsumption relationships (e.g. "natural rate" +subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect: + +- **Granularity mismatches** (C5): A rigid concept at the same level as + an anti-rigid one suggests different abstraction levels are mixed. +- **Definitional consistency** (C4): If entity A depends on entity B per + OntoClean, but B's definition doesn't acknowledge A, the definitions + are inconsistent. +- **Redundancy** (C1): Two entities with identical metaproperty profiles + and overlapping definitions are candidates for merging. + +**Adaptation:** Instead of manual metaproperty assignment, we use LLM-Eval +to classify each entity's rigidity, identity criterion, and dependencies. +The constraint checking is then deterministic. + +### 3.3 OOPS! (Ontology Pitfall Scanner) + +**Origin:** Poveda-Villalón et al. (2014). Catalogue of 41 common +ontology design pitfalls. + +**What it defines:** Concrete, testable anti-patterns. The pitfalls most +relevant to our infospace: + +| Pitfall | Description | Our concern | +|---------|-------------|-------------| +| P2 | Synonymous classes — different names, same meaning | C1 (redundancy) | +| P4 | Unconnected ontology elements | C3 (coherence) | +| P6 | Missing inverse relationships | C3 | +| P7 | Merging different concepts in the same class | C5 (granularity — too coarse) | +| P11 | Missing domain or range | C4 (consistency) | +| P19 | Missing disjointness axioms | C1 (how do we know two concepts don't overlap?) | +| P24 | Recursive/circular definition | C4 (consistency) | +| P25 | Inverse of itself | C4 | + +**How we use it:** OOPS! pitfalls become a **checklist for LLM-Eval +prompts**. Rather than running a formal OWL scanner, we ask the LLM to +check for each pitfall pattern: + +- "Are entities A and B synonymous?" (P2) +- "Does entity A's definition reference itself?" (P24) +- "Is entity A actually two distinct concepts merged together?" (P7) + +The deterministic layer counts pitfall occurrences and tracks them over +time. + +**Adaptation:** We select the subset of OOPS! pitfalls applicable to +semi-formal markdown-based ontologies (no OWL axioms) and implement each +as an LLM-Eval prompt pattern rather than a formal reasoner check. + +### 3.4 OntoQA (Metric-Based Ontology Quality Analysis) + +**Origin:** Tartir & Arpinar (2007). + +**What it defines:** Quantitative schema-level and instance-level metrics: + +- **Relationship Richness (RR):** Proportion of non-taxonomic (lateral) + relationships to total relationships. `RR = non_hierarchical / total`. + Low RR = mere taxonomy. High RR = rich cross-cutting connections. +- **Attribute Richness (AR):** Average number of attributes per concept. + `AR = total_attributes / total_concepts`. +- **Inheritance Richness (IR):** Average subclasses per class — measures + how knowledge distributes across the hierarchy. +- **Class Richness (CR):** Proportion of classes with instances. + +**How we use it:** Our entities don't have formal relationships declared +between them, but we can **infer** a relationship graph from their +definitions and mappings: + +- Entity A references entity B in its definition → definitional dependency +- Entities A and B map to the same VSM system → structural co-occurrence +- Entities A and B appear in the same chapter → contextual co-occurrence + +From this inferred graph, we compute OntoQA metrics directly: + +- **Relationship Richness** tells us whether our concepts form a web of + explanatory connections or just a flat list. +- **Attribute Richness** maps to our schema sections — entities with more + optional sections filled (Original Wording, Modern Interpretation) are + richer. + +**Adaptation:** The key modification is that relationship inference is an +LLM-Eval step (pairwise: "does A's definition depend on or reference B?"), +after which all OntoQA metrics are computed deterministically on the +resulting graph. + +### 3.5 Formal Concept Analysis (FCA) + +**Origin:** Wille (1982). Applied to ontology auditing by Elhaj et al. +(2008) for SNOMED CT completeness checking. + +**What it defines:** A mathematical framework for deriving a **concept +lattice** from a binary relation between objects and attributes. The +lattice reveals: + +- **Formal concepts**: maximal sets of objects sharing the same attributes +- **Subconcept/superconcept** relationships: the natural hierarchy +- **Missing concepts**: attribute combinations with no corresponding object + +**How we use it:** We construct a **formal context** (binary matrix): + +- **Objects** = our 85 entities +- **Attributes** = economic domain, VSM system, source book, abstraction + level (from LLM-Eval), key terms (extracted from definitions) + +The concept lattice then reveals: + +- **Coverage gaps** (C2): Attribute combinations with no entity. E.g. if + the cell {Distribution, S3} is empty, we lack control-layer concepts + for distribution — a specific, actionable gap. +- **Redundancy** (C1): Entities with identical attribute sets (same formal + concept) are candidates for merging. +- **Granularity** (C5): The lattice depth indicates how many meaningful + levels of abstraction exist. A shallow lattice suggests missing + intermediate concepts. + +**Adaptation:** Classic FCA requires crisp binary attributes. Our domains +and VSM mappings are already categorical, but abstraction level and key +terms need LLM-Eval to produce. The lattice computation itself is +deterministic (Python `concepts` library or equivalent). The FCA approach +replaces the current "ask the LLM about coverage" with a structural +computation that can identify *specific* gaps rather than vague +recommendations. + +### 3.6 DSL Design Principles + +**Origin:** Mernik et al. (2005) "When and How to Develop DSLs"; +Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages". + +**What they define:** Quality criteria for a set of concepts that form a +language for a specific domain: + +- **Soundness**: Every concept in the language corresponds to a real domain + concern (no invented abstractions). +- **Completeness**: The language can express everything needed for its + intended tasks. +- **Laconicity**: No unnecessary concepts — every concept earns its place. +- **Orthogonality**: Concepts are independent; combining any two produces + a meaningful result (no redundant combinations). + +**How we use it:** Our entity set is effectively a domain-specific +vocabulary for "explaining classical economics through VSM". DSL quality +criteria translate directly: + +- **Soundness** → Validity (SEQUAL): every entity grounded in Smith's text +- **Completeness** → Coverage (C2): can we answer the "competency + questions" the infospace is meant to address? +- **Laconicity** → Anti-redundancy (C1) + Indispensability (C5): would + removing any entity lose explanatory power? +- **Orthogonality** → Non-overlap (C1): entity definitions don't + substantially duplicate each other + +**Adaptation:** We operationalise DSL completeness through **competency +questions** — a set of canonical questions the infospace should be able to +answer (e.g. "How does the division of labour relate to market extent?", +"What mechanisms regulate wages toward their natural rate?"). LLM-Eval +tests whether the current entity set suffices to answer each question. +Unanswerable questions identify specific completeness gaps. + +Laconicity is operationalised as **indispensability scoring**: for each +entity, LLM-Eval rates whether removing it would lose explanatory power. +Low-scoring entities are candidates for merging or retirement. + +--- + +## 4. Integration: Metric Definitions by Concern + +### C1: Semantic Overlap / Redundancy + +**Goal:** Identify entities that substantially overlap in meaning and +should be merged, distinguished, or retired. + +**Metrics:** + +| Metric | Type | Computation | +|--------|------|-------------| +| `similarity_matrix` | Deterministic | Embed all entity definitions; compute NxN cosine similarity | +| `high_similarity_pairs` | Deterministic | Pairs with cosine > 0.80, sorted descending | +| `confirmed_synonyms` | LLM-Eval | For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap" | +| `redundancy_ratio` | Deterministic | `confirmed_synonyms / total_entities` | +| `intensional_conciseness` | Deterministic | `1 - redundancy_ratio` (from KG quality framework) | + +**Pipeline:** +1. Embed definitions (embedding API or local model) +2. Compute cosine similarity matrix +3. Filter pairs above threshold +4. LLM pairwise judgment on filtered pairs only (avoids N² LLM calls) +5. Aggregate into ratio and conciseness score + +**Output:** `output/metrics/redundancy-report.md` + structured YAML with +pair list, scores, and merge/retire recommendations. + +### C2: Coverage Completeness + +**Goal:** Identify domain areas and VSM systems that lack adequate +representation in the entity set. + +**Metrics:** + +| Metric | Type | Computation | +|--------|------|-------------| +| `domain_vsm_matrix` | Deterministic | Count entities per {economic_domain, VSM_system} cell | +| `coverage_ratio` | Deterministic | `populated_cells / expected_cells` | +| `vsm_balance_entropy` | Deterministic | Shannon entropy of entity distribution across VSM systems (higher = more balanced) | +| `empty_cells` | Deterministic | List of {domain, VSM_system} pairs with zero entities | +| `competency_coverage` | LLM-Eval | For each competency question, can it be answered with current entities? | +| `fca_gap_concepts` | Deterministic | Attribute combinations in the FCA lattice with no corresponding entity | + +**Pipeline:** +1. Parse entity metadata (domain, VSM mapping) from files on disk +2. Build domain × VSM matrix; identify empty cells +3. Build FCA formal context; compute lattice; extract gap concepts +4. Define competency questions (initially hand-written, later LLM-generated + from the source material) +5. LLM-evaluate answerability of each question +6. Aggregate into coverage ratio, entropy, and gap list + +**Output:** `output/metrics/coverage-report.md` + YAML with matrix, gaps, +and competency question results. + +### C3: Structural Coherence + +**Goal:** Determine whether the entities form a connected explanatory web +or a fragmented collection of isolated concepts. + +**Metrics:** + +| Metric | Type | Computation | +|--------|------|-------------| +| `relationship_graph` | LLM-Eval + Deterministic | Infer edges from definition cross-references (string matching) + LLM judgment for implicit references | +| `connected_components` | Deterministic | Number of connected components in the graph (target: 1) | +| `graph_density` | Deterministic | `actual_edges / possible_edges` | +| `avg_degree` | Deterministic | `total_edges / total_entities` | +| `relationship_richness` | Deterministic | OntoQA RR: `non_hierarchical_edges / total_edges` | +| `modularity` | Deterministic | Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation) | +| `bridge_concepts` | Deterministic | Entities with highest betweenness centrality (connect clusters) | +| `orphan_entities` | Deterministic | Entities with degree 0 or 1 | +| `cohesion_by_domain` | Deterministic | Avg intra-domain edges per entity | +| `coupling_across_domains` | Deterministic | Inter-domain edges / total edges | + +**Pipeline:** +1. Extract explicit cross-references from definitions (entity name + mentions in other definitions — string matching with slug normalisation) +2. For entity pairs not caught by string matching, LLM-Eval: "Does A's + definition depend on or reference B's concept?" +3. Build directed graph +4. Compute graph metrics (networkx or equivalent) +5. Run community detection; compare detected communities to declared + economic domains + +**Output:** `output/metrics/coherence-report.md` + YAML with graph +statistics, orphan list, bridge concepts, and community structure. + +### C4: Definitional Consistency + +**Goal:** Ensure entities are defined consistently, non-circularly, and +without contradicting each other. + +**Metrics:** + +| Metric | Type | Computation | +|--------|------|-------------| +| `definitional_dependency_graph` | Deterministic + LLM-Eval | Edges where A's definition uses B's concept | +| `circular_definitions` | Deterministic | Cycles of length ≤ 3 in the dependency graph | +| `definition_depth` | Deterministic | Longest dependency chain per entity before reaching a term not in the entity set | +| `undefined_dependencies` | Deterministic | Terms used in definitions that arguably should be entities but aren't | +| `pairwise_consistency` | LLM-Eval | For related entity pairs (sharing edges): "Do these definitions contradict each other?" | +| `source_fidelity` | LLM-Eval | "Does this definition accurately represent what Smith wrote in the cited passage?" | +| `metaproperty_violations` | LLM-Eval + Deterministic | OntoClean constraint checking after LLM classifies rigidity/identity | +| `grounding_ratio` | Deterministic | Fraction of entities traceable to primitives without cycles | + +**Pipeline:** +1. Build definitional dependency graph (same technique as C3, but directed + — A depends on B means A's definition uses B, not vice versa) +2. Detect cycles; flag short cycles +3. Extract undefined terms (terms matching entity-name patterns that appear + in definitions but have no corresponding entity file) +4. LLM pairwise consistency check on directly-connected pairs +5. LLM source fidelity check (compare definition to source chapter text) +6. LLM OntoClean metaproperty classification; deterministic constraint + checking + +**Output:** `output/metrics/consistency-report.md` + YAML with cycle list, +undefined terms, contradiction candidates, and metaproperty violations. + +### C5: Granularity Balance + +**Goal:** Ensure entities operate at comparable levels of abstraction +within their respective domains and perspectives. + +**Metrics:** + +| Metric | Type | Computation | +|--------|------|-------------| +| `abstraction_classification` | LLM-Eval | Classify each entity as theory-level / mechanism-level / observation-level | +| `scope_score` | LLM-Eval | Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle) | +| `abstraction_distribution` | Deterministic | Count per level; compute entropy | +| `scope_variance` | Deterministic | Variance of scope scores within each domain | +| `level_x_perspective_matrix` | Deterministic | Cross-tabulation of abstraction level × economic domain | +| `indispensability` | LLM-Eval | "If removed, what explanatory power is lost?" (1-5) | +| `dispensable_entities` | Deterministic | Entities with indispensability score ≤ 2 | +| `merge_candidates` | LLM-Eval | Pairs where one is a sub-case of the other | + +**Pipeline:** +1. LLM-classify each entity: abstraction level, scope score, + indispensability +2. Build level × perspective matrix +3. Compute distribution entropy and per-domain scope variance +4. Flag outliers: entities whose scope score deviates > 1.5σ from their + domain mean +5. For outlier entities, LLM-Eval: "Should this be merged into a broader + concept, or split into sub-concepts?" + +**Output:** `output/metrics/granularity-report.md` + YAML with +classifications, distribution, outliers, and merge/split recommendations. + +--- + +## 5. Shared Infrastructure + +Several concerns share underlying computations: + +| Infrastructure | Used by | Build once | +|---------------|---------|------------| +| Definition embeddings (vector per entity) | C1, C3 | Embedding API call per entity | +| Relationship graph (entity → entity edges) | C3, C4 | String matching + LLM-Eval | +| FCA formal context (entity × attribute matrix) | C2, C5 | Metadata parsing + LLM classification | +| Entity metadata index (domain, VSM, chapter, sections) | C2, C5, C10 (schema compliance) | Deterministic markdown parsing | + +These should be computed once per evaluation run and cached for use by +all concern-specific metrics. + +--- + +## 6. Evaluation Workflow + +A full collection-level evaluation run: + +``` +process_chapters.py --evaluate-collection --provider +``` + +1. **Parse** — deterministic metadata extraction from all entity files +2. **Embed** — compute definition embeddings (cached; only new/changed + entities need fresh embeddings) +3. **Infer** — LLM-Eval for relationship edges, metaproperties, + abstraction levels, pairwise judgments (batched to minimise LLM calls) +4. **Compute** — deterministic graph metrics, FCA lattice, coverage + matrix, similarity matrix, cycle detection +5. **Aggregate** — combine per-entity and per-pair scores into + collection-level metrics +6. **Report** — write per-concern markdown reports + unified `metrics.yaml` +7. **Append** — add timestamped snapshot to `metrics-history.yaml` + +Incremental mode (`--evaluate-collection --chapter `) re-evaluates +only the entities introduced or modified by that chapter, plus any +pairwise checks involving those entities. + +--- + +## 7. References + +- Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding + Quality in Conceptual Modeling." *IEEE Software* 11(2), 42-49. + → SEQUAL framework: validity and completeness dimensions. + +- Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In + *Handbook on Ontologies*, Springer, 151-171. + → Metaproperty analysis: rigidity, identity, unity, dependence. + +- Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014). + "OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology + Evaluation." *IJSWIS* 10(2), 7-34. + → Pitfall catalogue: 41 anti-patterns for ontology design. + +- Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking + using OntoQA." *ICSC 2007*, IEEE, 185-192. + → Schema metrics: relationship richness, attribute richness. + +- Wille, R. (1982). "Restructuring Lattice Theory." In *Ordered Sets*, + Reidel, 445-470. + → Formal Concept Analysis: concept lattices from binary contexts. + +- Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept + Analysis." *AMIA Annual Symposium*, PMC2605587. + → FCA for ontology completeness auditing. + +- Keet, C.M. (2008). *A Formal Theory of Granularity.* PhD thesis, + Free University of Bozen-Bolzano. + → Granularity levels and perspectives for ontology design. + +- Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to + Develop Domain-Specific Languages." *ACM Computing Surveys* 37(4), + 316-344. + → DSL design: soundness, completeness, laconicity. + +- Karsai, G. et al. (2014). "Design Guidelines for Domain Specific + Languages." *arXiv:1409.2378*. + → Orthogonality, necessary-and-sufficient principle. + +- Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A + Comprehensive Survey." *IEEE TKDE* 35(5), 4969-4988. + → KG quality dimensions: conciseness, consistency, completeness. diff --git a/roadmap/infospace-tooling/PLAN.md b/roadmap/infospace-tooling/PLAN.md new file mode 100644 index 00000000..5253736a --- /dev/null +++ b/roadmap/infospace-tooling/PLAN.md @@ -0,0 +1,621 @@ +# Viable Infospace Tooling — Roadmap + +## Vision + +An **infospace** is a structured, evaluable, composable collection of +concepts that explains a **topic** through the lens of one or more +**disciplines**. Infospaces are the unit of knowledge work in MarkiTect. + +This roadmap organises the work needed to move from the current +ad-hoc example (`infospace-with-history`) to a general-purpose platform +for creating, evaluating, maintaining, and composing infospaces. + +--- + +## Terminology + +These terms establish the vocabulary for infospace tooling. They +generalise from the Wealth of Nations / VSM example but are not +specific to it. + +### Infospace + +A curated, self-describing collection of **entities** (concepts, +mechanisms, observations) that together explain a **topic**. An +infospace has: + +- A **topic** — the subject matter being explained (e.g. "The Wealth + of Nations", "cellular biology", "Kubernetes networking") +- One or more **disciplines** — external frameworks applied as lenses + (e.g. "Viable System Model", "category theory") +- **Entities** — the atomic units of knowledge, each with a definition, + provenance, and quality scores +- **Schemas** — structural templates that define what a well-formed + entity, mapping, or analysis looks like +- **Evaluations** — per-entity and collection-level quality assessments +- **Metrics** — quantitative indicators of completeness, coherence, + consistency, and granularity balance + +An infospace is **viable** when it meets threshold scores across its +defined metrics — it is fit for purpose as an explanatory tool. + +### Topic + +The subject matter an infospace is built to explain. A topic sits +within a **domain** (broader field of knowledge) but is more specific: + +- Domain: Economics → Topic: The Wealth of Nations +- Domain: Systems Theory → Topic: Viable System Model +- Domain: Computer Science → Topic: Distributed consensus protocols + +A topic provides the **source material** — the texts, data, or +observations from which entities are extracted. + +### Discipline + +A reusable framework of concepts applied as a lens to explore a topic. +A discipline is itself an infospace — one that has been evaluated as +viable and packaged for reuse. + +In our example, the VSM is the discipline: a set of concepts (S1-S5, +recursion, variety, viability) from systems theory, applied to the +economic concepts in Smith's work. + +**Key property:** Disciplines compose. An infospace built with one +discipline can itself become a discipline for another infospace. The +Wealth of Nations infospace, viewed through VSM, could become a +discipline applied to a modern supply chain analysis. + +### Entity + +The atomic unit of an infospace. An entity has: + +- **Identity**: a unique slug and human-readable title +- **Definition**: a precise, non-circular explanation +- **Provenance**: the source chapter, passage, and extraction context +- **Domain placement**: which area of the topic it belongs to +- **Discipline mapping**: how it connects to the applied discipline + (e.g. which VSM system) +- **Quality scores**: per-entity LLM-evaluated metrics +- **Lifecycle state**: active, archived (with reason), or draft + +### Evaluation + +A structured assessment of quality, applied at two levels: + +- **Per-entity evaluation**: scores an individual entity against + quality rubrics defined in its schema (definition precision, source + grounding, discipline relevance, etc.) +- **Collection evaluation**: scores the entity set as a whole against + five concerns: redundancy, coverage, coherence, consistency, and + granularity balance + +Evaluations are always performed by **delegated LLM calls** through +MarkiTect's LLM integration — never by the coding agent working on +infrastructure. This separation ensures that domain-level judgment +stays in the problem space, not the tooling space. + +### Viability + +An infospace is viable when: + +1. Its entities individually meet quality thresholds (per-entity eval) +2. Its collection metrics are within acceptable ranges +3. It can answer its defined **competency questions** — the canonical + queries the infospace is meant to support +4. It has been evaluated recently enough that metrics reflect current + content + +Viability is not binary — it is a profile of scores that the user +sets thresholds for based on their needs. + +--- + +## Architecture: Three Layers + +``` +┌──────────────────────────────────────────────────┐ +│ Layer 3: Infospace Instances │ +│ Specific infospaces built by users │ +│ (Wealth of Nations + VSM, supply chain + ...) │ +│ Works IN an infospace │ +├──────────────────────────────────────────────────┤ +│ Layer 2: Infospace Tooling │ +│ Terminology, primitives, composition model │ +│ CLI: infospace create/evaluate/compose/... │ +│ Works WITH infospaces │ +├──────────────────────────────────────────────────┤ +│ Layer 1: MarkiTect Platform │ +│ Artifacts, prompts, LLM, spaces, graph, embed │ +│ Provides FOR infospaces │ +└──────────────────────────────────────────────────┘ +``` + +### Boundary condition: LLM delegation + +All LLM-based evaluation (entity scoring, pairwise judgments, coverage +analysis) is delegated to MarkiTect's LLM integration module. The coding +agent that works on infrastructure never makes domain-level judgments +itself. This keeps a clean separation: + +- **Coding agent** → writes Python, templates, schemas, tests +- **MarkiTect LLM** → evaluates entities, judges redundancy, assesses + coverage, checks consistency + +The infospace tooling (Layer 2) orchestrates these LLM calls through +prompt templates and the prompt execution engine, not through ad-hoc +prompting. + +--- + +## Stage 1: MarkiTect Platform Additions + +Infrastructure that must exist before infospace tooling can be built. +These are general-purpose platform capabilities, not infospace-specific. + +### S1.1 — Entity metadata parser + +Add a deterministic markdown parser that extracts structured metadata +from entity files: H1 title, sections present, word counts, domain, +source chapter. Returns a dataclass usable by all downstream metrics. + +**Maps to:** INFRA-TASKS #13, #10 +**Location:** `markitect/prompts/quality/` or new `markitect/analysis/` +**Depends on:** Nothing — can start immediately +**Deliverable:** `parse_entity_metadata(path) -> EntityMeta` function +with tests + +### S1.2 — Schema compliance validator + +Deterministic validation of entity/mapping files against their schemas: +section presence, word count ranges, heading format, enum values. No +LLM needed. + +**Maps to:** INFRA-TASKS #10 +**Location:** `markitect/prompts/quality/validator.py` (extend existing) +**Depends on:** S1.1 +**Deliverable:** `validate_document(path, schema) -> ValidationResult` +with tests + +### S1.3 — Embedding adapter + +Add embedding support to `markitect/llm/`. Needs: + +- `EmbeddingAdapter` interface: `embed(texts: list[str]) -> list[list[float]]` +- `OpenRouterEmbeddingAdapter` implementation (or OpenAI embedding endpoint) +- Caching layer: store embeddings keyed by `{slug: content_digest}` so + unchanged entities skip re-embedding +- Cosine similarity utility: `similarity_matrix(embeddings) -> np.ndarray` + +**Maps to:** INFRA-TASKS #14 (prerequisite) +**Location:** `markitect/llm/embeddings.py` +**Depends on:** Nothing — can start immediately +**Deliverable:** Embedding adapter + cache + similarity computation, with +tests + +### S1.4 — Graph analysis utilities + +The existing `DependencyGraph` supports basic traversal and cycle +detection. Collection-level metrics need richer analysis: + +- Connected components +- Betweenness centrality +- Community detection (Louvain or label propagation) +- Modularity score +- Degree distribution +- Cohesion/coupling computation + +Decide: extend `DependencyGraph` or add a lightweight wrapper that +converts to networkx (adding it as an optional dependency). + +**Maps to:** INFRA-TASKS #16 (prerequisite) +**Location:** `markitect/prompts/dependencies/analysis.py` or new +`markitect/analysis/graph.py` +**Depends on:** Nothing — can start immediately +**Deliverable:** Graph analysis functions with tests + +### S1.5 — Structured evaluation output + +Define a standard format for evaluation results: YAML front-matter + +markdown body. Add utilities for: + +- Writing evaluation results (per-entity, per-pair, collection-level) +- Reading/parsing evaluation results back into dataclasses +- Appending timestamped snapshots to a history file +- Diffing two snapshots + +**Maps to:** INFRA-TASKS #11, #12 +**Location:** `markitect/prompts/quality/` or `markitect/analysis/` +**Depends on:** S1.1 +**Deliverable:** `EvaluationResult` model + read/write utilities with +tests + +### S1.6 — Batch LLM evaluation orchestrator + +A pipeline component that runs an evaluation prompt template against a +batch of entities (or entity pairs), collecting structured results. +Must handle: + +- Rate limiting and retry (reuse existing adapter logic) +- Progress reporting +- Incremental evaluation (skip entities whose content hasn't changed + since last eval) +- Result aggregation + +This is the mechanism by which infospace tooling delegates LLM work +to the platform. + +**Maps to:** INFRA-TASKS #9 (prerequisite) +**Location:** `markitect/prompts/execution/batch.py` +**Depends on:** S1.5 +**Deliverable:** `BatchEvaluator` class with tests + +### S1.7 — FCA computation + +Formal Concept Analysis: build a formal context (entity × attribute +matrix), compute the concept lattice, extract gap concepts. Either +implement a minimal FCA algorithm or integrate a library. + +**Maps to:** INFRA-TASKS #15 (prerequisite) +**Location:** `markitect/analysis/fca.py` +**Depends on:** S1.1 +**Deliverable:** `FormalContext`, `ConceptLattice`, `find_gap_concepts()` +with tests + +### Summary: Stage 1 dependency graph + +``` +S1.1 Entity metadata parser ──┬── S1.2 Schema validator + ├── S1.5 Eval output format ── S1.6 Batch evaluator + └── S1.7 FCA computation + +S1.3 Embedding adapter ──────── (independent) +S1.4 Graph analysis ─────────── (independent) +``` + +S1.1, S1.3, and S1.4 can proceed in parallel. S1.6 (batch evaluator) is +the final piece needed before Stage 2 can begin. + +--- + +## Stage 2: Infospace Tooling + +The user-facing layer that provides documented primitives for working +with infospaces. Built on top of Stage 1 infrastructure and the existing +`markitect/spaces/` module. + +### S2.1 — Infospace model and configuration + +Define the `Infospace` as a first-class concept that extends the existing +`InformationSpace` with: + +- **Topic declaration**: name, domain, source material reference +- **Discipline bindings**: which external infospaces are applied as lenses +- **Schema registry**: which schemas govern entity structure +- **Competency questions**: what the infospace should be able to answer +- **Viability thresholds**: minimum acceptable metric scores +- **Evaluation state**: latest per-entity and collection scores + +Configuration format: a `infospace.yaml` (or section in existing config) +that declares all of the above. + +**Location:** new `markitect/infospace/` package +**Depends on:** S1.1, S1.5, existing `markitect/spaces/` +**Deliverable:** `InfospaceConfig`, `InfospaceState` models + loader + +### S2.2 — Infospace lifecycle commands + +CLI commands for the core lifecycle: + +```bash +# Initialise a new infospace +markitect infospace init --topic "Wealth of Nations" \ + --domain "Economics" \ + --discipline vsm-framework + +# Show infospace status (entity count, eval state, viability) +markitect infospace status + +# List entities with quality summary +markitect infospace entities [--sort-by score|domain|chapter] + +# Show viability dashboard +markitect infospace viability +``` + +These commands read the `infospace.yaml` config and present information +from the metadata index and evaluation results. + +**Location:** `markitect/infospace/cli.py` integrated into main CLI +**Depends on:** S2.1 +**Deliverable:** CLI commands with help text and tests + +### S2.3 — Per-entity evaluation primitives + +Prompt templates and CLI commands for evaluating individual entities: + +```bash +# Evaluate all entities +markitect infospace evaluate --provider openrouter + +# Evaluate entities from a specific chapter +markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter + +# Re-evaluate a single entity +markitect infospace evaluate --entity division-of-labour --provider openrouter +``` + +Uses the batch evaluator (S1.6) to run the evaluate-entity prompt +template (defined in the infospace's schema directory) against entities. +Writes structured results to `output/evaluations/`. + +**Maps to:** INFRA-TASKS #8, #9 +**Location:** `markitect/infospace/evaluation.py` +**Depends on:** S1.6, S2.1 +**Deliverable:** Per-entity evaluation pipeline + CLI + prompt template + +### S2.4 — Collection-level checks + +CLI commands for each of the five collection concerns: + +```bash +# Run all collection checks +markitect infospace check --provider openrouter + +# Run specific checks +markitect infospace check redundancy --provider openrouter +markitect infospace check coverage --provider openrouter +markitect infospace check coherence --provider openrouter +markitect infospace check consistency --provider openrouter +markitect infospace check granularity --provider openrouter +``` + +Each check uses Stage 1 infrastructure (embeddings, graph analysis, FCA) +and delegates LLM judgment to the platform. Results written to +`output/metrics/` as per-concern reports + unified `metrics.yaml`. + +**Maps to:** INFRA-TASKS #14-19 +**Location:** `markitect/infospace/checks/` (one module per concern) +**Depends on:** S1.3, S1.4, S1.6, S1.7, S2.1 +**Deliverable:** Five check modules + unified orchestrator + CLI + +### S2.5 — Metrics history and viability tracking + +Track metrics over time. After each evaluation or check run, append a +timestamped snapshot to `metrics-history.yaml`. Provide commands to +review trends: + +```bash +# Show metrics history +markitect infospace history + +# Compare two snapshots +markitect infospace history diff 2026-02-18 2026-03-01 + +# Check viability against thresholds +markitect infospace viability +``` + +Viability is assessed by comparing current metrics to the thresholds +declared in `infospace.yaml`. A simple pass/fail per metric with the +actual value. + +**Maps to:** INFRA-TASKS #12 +**Location:** `markitect/infospace/history.py` +**Depends on:** S2.4, S1.5 +**Deliverable:** History tracking + viability assessment + CLI + +### S2.6 — Infospace composition model + +The mechanism by which one infospace is applied as a discipline to +another. Builds on `markitect/spaces/composability/`: + +- **Discipline binding**: declare that infospace A uses infospace B as a + discipline. B's entities become available as mapping targets. +- **Cross-infospace references**: entity in A maps to concept in B using + the same mapping schema and evaluation pipeline. +- **Discipline viability requirement**: B must be viable (meets its own + thresholds) before it can be used as a discipline for A. +- **Cascading evaluation**: when B's entities change, A's mappings that + reference them are flagged for re-evaluation. + +```bash +# Bind a discipline to the current infospace +markitect infospace bind-discipline ./path/to/vsm-infospace + +# List bound disciplines and their viability +markitect infospace disciplines + +# Check for stale mappings after discipline update +markitect infospace check stale-mappings +``` + +**Location:** `markitect/infospace/composition.py` +**Depends on:** S2.1, existing `markitect/spaces/composability/` +**Deliverable:** Composition model + CLI + documentation + +### S2.7 — Documentation: Infospace Primitives Reference + +A reference document explaining all primitives, their purpose, and how +they compose. This is the user-facing documentation for the infospace +tooling layer — the equivalent of a framework guide. + +**Location:** `docs/infospace-primitives.md` or in-CLI help +**Depends on:** S2.1-S2.6 +**Deliverable:** Reference documentation + +### Summary: Stage 2 dependency graph + +``` +S2.1 Model & config ──┬── S2.2 Lifecycle CLI + ├── S2.3 Per-entity evaluation + ├── S2.4 Collection checks ── S2.5 History & viability + └── S2.6 Composition model + +S2.7 Documentation (depends on all above) +``` + +--- + +## Stage 3: Example Revision + +Revisit the Wealth of Nations / VSM example using the new tooling. +The example becomes both a tutorial and a validation of the tooling. + +### S3.1 — Migrate example to infospace configuration + +Replace the ad-hoc `process_chapters.py` setup with a declarative +`infospace.yaml`: + +```yaml +topic: + name: "The Wealth of Nations" + domain: "Classical Economics" + sources: artifacts/sources/ + +disciplines: + - name: "Viable System Model" + path: artifacts/vsm-reference/ + +schemas: + entity: schemas/economic-entity-schema-v1.0.md + mapping: schemas/vsm-mapping-schema-v1.0.md + analysis: schemas/chapter-analysis-schema-v1.0.md + +competency_questions: schemas/competency-questions.md + +viability: + redundancy_ratio: { max: 0.05 } + coverage_ratio: { min: 0.60 } + coherence_components: { max: 1 } + consistency_cycles: { max: 0 } + granularity_entropy: { min: 1.0 } + per_entity_mean: { min: 3.5 } + +pipeline: + stages: + - template: extract-entities + spaces: [sources, guidelines, vsm-reference, entities] + - template: map-to-vsm + spaces: [entities, vsm-reference, guidelines] + - template: synthesize-analysis + spaces: [sources, entities, mappings, vsm-reference] + post_batch: + - template: assess-metrics + spaces: [analyses, vsm-reference] +``` + +**Depends on:** S2.1 +**Deliverable:** `infospace.yaml` + migration of `process_chapters.py` to +use infospace tooling APIs + +### S3.2 — Clean per-chapter git history + +Re-run all processed chapters (and remaining ones) with per-chapter +commits on a clean branch, then replace the current tangled history. + +**Maps to:** INFRA-TASKS #4, #7 +**Depends on:** S3.1 +**Deliverable:** Clean branch with one commit per chapter + +### S3.3 — Full evaluation run + +Run all per-entity evaluations and collection checks on the completed +infospace. Establish baseline metrics. Demonstrate the viability +dashboard. + +**Maps to:** INFRA-TASKS #6 +**Depends on:** S2.3, S2.4, S2.5, S3.2 +**Deliverable:** Complete evaluation results + viability report + +### S3.4 — Rewrite tutorial + +Update `TUTORIAL.md` to use infospace tooling commands instead of +raw `process_chapters.py` invocations. The tutorial should walk +through: + +1. Initialising an infospace (`markitect infospace init`) +2. Defining schemas and competency questions +3. Processing chapters (pipeline execution) +4. Evaluating entities (`markitect infospace evaluate`) +5. Running collection checks (`markitect infospace check`) +6. Reviewing viability (`markitect infospace viability`) +7. Iterating: refining guidelines, re-processing, re-evaluating +8. Using the infospace as a discipline for a new project + +**Depends on:** S3.1-S3.3 +**Deliverable:** Revised `TUTORIAL.md` + +### S3.5 — Demonstrate composition + +Create a minimal second infospace (e.g. a modern supply chain case +study or a different economic text) that binds the Wealth of Nations +infospace as a discipline. Demonstrates the composition model from S2.6. + +**Depends on:** S2.6, S3.3 +**Deliverable:** Second example infospace + composition tutorial section + +--- + +## Task Mapping + +Cross-reference between INFRA-TASKS numbers and roadmap stages: + +| INFRA-TASK | Description | Stage | +|------------|-------------|-------| +| 1-3 | Infra fixes (resolved) | — | +| 4 | Per-chapter git history | S3.2 | +| 5 | Prompt file side-effects | S1.6 (batch eval avoids this) | +| 6 | Stale metrics | S3.3 | +| 7 | Remaining 28 chapters | S3.2 | +| 8 | Per-concept quality metrics in schema | S2.3 | +| 9 | Evaluate-entity prompt template | S2.3 | +| 10 | Deterministic schema compliance | S1.2 | +| 11 | Structured metrics output | S1.5 | +| 12 | Metrics-over-time tracking | S2.5 | +| 13 | Entity metadata index | S1.1 | +| 14 | Redundancy detection (C1) | S2.4 | +| 15 | Coverage completeness (C2) | S2.4 | +| 16 | Structural coherence (C3) | S2.4 | +| 17 | Definitional consistency (C4) | S2.4 | +| 18 | Granularity balance (C5) | S2.4 | +| 19 | Unified collection evaluation | S2.4 | + +--- + +## Implementation Order + +Recommended sequence, accounting for dependencies and value delivery: + +**Phase A — Foundation (Stage 1, parallelisable)** +1. S1.1 Entity metadata parser +2. S1.3 Embedding adapter +3. S1.4 Graph analysis utilities + +**Phase B — Validation & Output (Stage 1)** +4. S1.2 Schema compliance validator (needs S1.1) +5. S1.5 Structured evaluation output (needs S1.1) +6. S1.7 FCA computation (needs S1.1) + +**Phase C — Orchestration (Stage 1 → Stage 2 bridge)** +7. S1.6 Batch LLM evaluation orchestrator (needs S1.5) + +**Phase D — Infospace Core (Stage 2)** +8. S2.1 Infospace model and configuration +9. S2.2 Lifecycle commands +10. S2.3 Per-entity evaluation primitives (needs S1.6, S2.1) + +**Phase E — Collection Intelligence (Stage 2)** +11. S2.4 Collection-level checks (needs S1.3, S1.4, S1.7, S2.1) +12. S2.5 Metrics history and viability tracking + +**Phase F — Composition (Stage 2)** +13. S2.6 Infospace composition model +14. S2.7 Documentation + +**Phase G — Example (Stage 3)** +15. S3.1 Migrate example to infospace config +16. S3.2 Clean per-chapter history +17. S3.3 Full evaluation run +18. S3.4 Rewrite tutorial +19. S3.5 Demonstrate composition