docs: metrics methodology, collection-level tasks, and infospace tooling roadmap

Add METRICS-METHODOLOGY.md documenting the theoretical frameworks
(SEQUAL, OntoClean, OOPS!, OntoQA, FCA, DSL principles) adapted for
two-layer evaluation (LLM-Eval + deterministic aggregation) across
five collection concerns: redundancy, coverage, coherence, consistency,
and granularity balance.

Extend INFRA-TASKS.md with assignment assessment (tasks 4-7),
per-concept metrics (tasks 8-12), and collection-level metrics
(tasks 13-19).

Add roadmap/infospace-tooling/PLAN.md defining terminology (infospace,
topic, discipline, entity, evaluation, viability) and a three-stage
implementation plan: Stage 1 platform additions, Stage 2 infospace
tooling layer, Stage 3 example revision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-18 23:53:21 +01:00
parent 2f0989f9bf
commit 4ce856d4d0
3 changed files with 1632 additions and 0 deletions

View File

@@ -37,3 +37,513 @@ no automatic parsing for this format, requiring manual macro construction.
**Fix applied:** Added `SHORTHAND_PATTERN` to `MacroParser` that recognises
`@{target}` and maps it to `MacroKind.REQUIRED`. Updated `has_macros()`,
`count_macros()`, and `find_macro_positions()` accordingly.
---
## Assignment Assessment (18 Feb 2026)
How the example measures against the objectives stated in `README.md`:
| # | Objective | Status | Notes |
|---|-----------|--------|-------|
| 1 | Capture knowledge from Wealth of Nations | **Partial** | 7 of 35 chapters processed (Book I, ch. 1-7). 85 canonical entities extracted. |
| 2 | Transform to VSM concepts/entities | **Done (for processed chapters)** | Entities mapped to S1-S5 with strength ratings. |
| 3 | Consistent and complete | **Not yet** | Only 20% of chapters done. Metrics report exists but covers limited scope. |
| 4 | Schemas as scaffolding | **Done** | Four schemas defined and used across all stages. |
| 5 | Prompt dependency resolution | **Done** | `@{macro}` templates resolved via MultiSpaceResolutionStrategy. |
| 6 | Incremental chapter injection | **Done** | Pipeline processes one chapter at a time; `@{existing_entities}` prevents duplication. |
| 7 | Keep changes as git history | **Not done** | See task 4 below. |
| 8 | Metrics for completeness/consistency | **Partial** | Template and report exist but only cover 4 chapters (report predates ch. 5-7). |
| 9 | No infrastructure changes during experiment | **Violated** | Three infra fixes were required (tasks 1-3 above). Documented as intended. |
| 10 | Generate task list for infra issues | **Done** | This file. |
## 4. Infospace has no per-chapter git history — OPEN
**Objective:** README states "The information space should utilize the option
of keeping changes as git history."
**Issue:** The 7 processed chapters were committed in mixed batches alongside
infrastructure changes (LLM adapters, entity refactoring, archive policy).
Chapters 1-2 are bundled into `fecc2fd` with the entire LLM module.
Chapters 5-7 share a single commit (`41773f1`) with the OpenAI adapter and
archive policy. There is no commit where you can `git diff` to see exactly
what one chapter contributed to the infospace.
**Impact:** Cannot use `git log`, `git diff`, or `git bisect` to trace how
the infospace grew chapter by chapter — the core promise of "with history."
**Suggested fix:** Re-run the 7 processed chapters (and remaining 28) using
`process_chapters.py` without `--no-commit`, on a clean branch or after
squashing the current output into a baseline commit. Each chapter gets its
own commit via `_git_commit_chapter()`.
## 5. Prompt files are regenerated as a side-effect of DB rebuild — OPEN
**Issue:** Running `--all --no-commit` to regenerate `infospace.db` also
overwrites `*-prompt.md` files in the output directories because each
pipeline stage unconditionally writes the compiled prompt before checking
whether output already exists. The `@{existing_entities}` macro content
shifts as earlier chapters are loaded, so prompt files for already-processed
chapters change on every full run.
**Impact:** A DB regeneration dirties the working tree with prompt file
changes, even though no actual outputs changed. Users must `git checkout`
the prompt files after regeneration.
**Suggested fix:** Skip writing prompt files when the corresponding output
file already exists on disk, or add a `--rebuild-db-only` flag that
populates the database without touching the file system.
## 6. Metrics report is stale — OPEN
**Issue:** The metrics report (`output/metrics/metrics-report.md`) was
generated after chapters 1-4. Chapters 5-7 have since been processed but
the report has not been refreshed.
**Impact:** The metrics do not reflect the current state of the infospace.
**Suggested fix:** Re-run `--metrics --provider <provider> --no-commit`
after every batch of new chapters. Consider making metrics assessment
automatic at the end of `--book` or `--all` runs.
## 7. Remaining 28 chapters not yet processed — OPEN
**Issue:** Only Book I chapters 1-7 have been processed. Books II-V
(28 chapters) remain unprocessed.
**Impact:** The infospace is incomplete — VSM coverage is limited to S1,
S2, and partial S4. S3, S3*, S5, and many systemic concepts (algedonic
signals, recursion, variety) are expected to emerge from later books.
**Suggested fix:** Process remaining chapters in book-sized batches with
per-chapter commits, refreshing metrics after each book.
---
## Per-Concept Metrics (tasks 8-12)
The current metrics system is a single LLM-evaluated narrative report that
assesses the infospace as a whole. It produces no machine-readable output,
cannot be tracked over time, and conflates per-concept quality with
collection-level coherence.
The improvement splits metrics into two layers:
- **LLM-Eval**: A prompt template evaluates each concept individually
against quality criteria defined in the schema. The LLM returns structured
scores, not prose.
- **Deterministic aggregation**: `process_chapters.py` computes what it can
from files on disk (schema compliance, word counts, section presence,
coverage tallies) and aggregates LLM-eval scores into dashboard metrics.
Both layers persist results in structured form so they can be diffed,
tracked over time, and committed alongside the entities they evaluate.
## 8. Add per-concept quality metrics to entity schema — OPEN
**Issue:** The entity schema (`economic-entity-schema-v1.0.md`) defines
required sections and validation rules (section presence, word count range)
but no quality criteria. There is no definition of what makes a *good*
entity versus a merely *compliant* one.
**Suggested fix:** Add a `## Quality Metrics` section to the entity schema
defining evaluation dimensions with scoring rubrics:
- **Definition Precision** (1-5): Is the definition specific, non-circular,
and distinguishable from neighbouring concepts?
- **Source Grounding** (1-5): Is the entity grounded in a specific passage?
Does the citation exist and support the definition?
- **Domain Placement** (1-5): Is the economic domain assignment correct and
specific (not just "General Theory")?
- **VSM Relevance** (1-5): Does the entity connect meaningfully to at least
one VSM system, or is it too granular/abstract to map?
- **Explanatory Value** (1-5): Does this entity contribute to explaining
the economic system, or is it a restatement of another concept?
Similarly update the VSM mapping schema with:
- **Rationale Rigour** (1-5): Is the mapping justified with reference to
Beer's definitions, not just surface-level analogy?
- **Strength Calibration** (1-5): Is the declared strength (Strong/Moderate/
Weak) consistent with the rationale given?
These rubrics become the prompt instructions for task 9.
## 9. Create evaluate-entity prompt template — OPEN
**Depends on:** Task 8 (quality metrics in schema).
**Issue:** There is no mechanism to evaluate an existing entity after
extraction. Quality is only judged implicitly during the global metrics
assessment, which is too coarse to identify individual weak entities.
**Suggested fix:** Create `templates/evaluate-entity.md` — a prompt
template that:
1. Takes `@{entity_content}`, `@{source_chapter}`, `@{vsm_framework}`,
and `@{quality_rubric}` (from the schema's quality metrics section).
2. Asks the LLM to score each dimension (1-5) with a one-sentence
justification per score.
3. Outputs structured YAML front-matter (scores) followed by markdown
(justifications), e.g.:
```yaml
---
entity: division-of-labour
scores:
definition_precision: 5
source_grounding: 5
domain_placement: 4
vsm_relevance: 5
explanatory_value: 5
overall: 4.8
flags: []
---
```
Add a pipeline stage: `--evaluate` runs this template against every
canonical entity and writes results to `output/evaluations/<slug>-eval.md`.
A `--evaluate --chapter <id>` variant evaluates only entities introduced
by that chapter.
## 10. Add deterministic schema compliance checker — OPEN
**Issue:** Schema compliance is currently LLM-evaluated ("100%" in the
metrics report) but the validation rules in the schemas are mechanical:
section presence, word count ranges, heading format. These should be
checked programmatically, not by an LLM.
**Suggested fix:** Add a `validate_entity(path) -> ValidationResult`
function to `process_chapters.py` (or a new `validate.py` module) that:
- Parses the markdown to extract H2 section headings
- Checks required sections are present (Definition, Source Chapter,
Context, Economic Domain)
- Counts words in the Definition section (must be 20-150)
- Checks H1 heading exists and is not a slug (e.g. `effectual-demand`
in chapter 7 has `# effectual-demand` instead of `# Effectual Demand`)
- Validates Source Chapter cites a specific book/chapter
- For mapping files: checks Mapping Strength is one of the enum values
Expose as `--validate` CLI flag. Output a structured report:
```
Validation: 85 entities, 3 warnings
effectual-demand.md: H1 is slug format, not title case
porter.md: Definition is 18 words (minimum 20)
...
```
This is fully deterministic — no LLM calls needed.
## 11. Structured metrics output format — OPEN
**Depends on:** Tasks 9 and 10.
**Issue:** The metrics report is a markdown narrative. Values cannot be
parsed programmatically, diffed meaningfully, or plotted over time.
**Suggested fix:** Alongside the human-readable `metrics-report.md`,
emit a machine-readable `metrics.yaml` (or `.json`) containing:
```yaml
timestamp: "2026-02-18T12:00:00Z"
chapters_processed: 7
chapters_total: 35
entities_total: 85
entities_archived: 0
vsm_coverage:
S1: 28
S2: 12
S3: 8
S3_star: 0
S4: 5
S5: 0
recursion: 1
variety: 0
mapping_strength:
strong: 64
moderate: 18
weak: 3
validation:
schema_compliant: 82
warnings: 3
evaluation: # from LLM-eval (task 9)
mean_overall: 4.2
min_overall: 2.8
flagged_entities: ["porter", "country-workman"]
```
The `--metrics` command writes both files. The YAML file is committed
to git so `git diff` shows exactly how metrics changed between runs.
## 12. Metrics-over-time tracking — OPEN
**Depends on:** Task 11 (structured output).
**Issue:** There is one metrics snapshot that gets overwritten. No history
of how metrics evolved as chapters were added.
**Suggested fix:** Append each metrics snapshot to a cumulative log file
`output/metrics/metrics-history.yaml` (list of timestamped entries). This
is committed to git alongside the current snapshot. The pipeline can
optionally render a simple text-based progress summary:
```
Metrics history (5 snapshots):
2026-02-10 ch 1/35 13 entities 41.7% VSM coverage
2026-02-11 ch 4/35 38 entities 50.0% VSM coverage
2026-02-11 ch 7/35 85 entities 58.3% VSM coverage
...
```
This provides the "metrics that improve over time" feedback loop the
README envisions: process chapters → evaluate → see coverage grow (or
flag regressions when a re-extraction reduces quality scores).
---
## Collection-Level Metrics (tasks 13-19)
These tasks implement the five collection-level concerns described in
`METRICS-METHODOLOGY.md`. They share underlying infrastructure (entity
metadata index, definition embeddings, relationship graph) that should
be built once per evaluation run.
See the methodology document for theoretical grounding, framework
references, and the full metric definitions per concern.
## 13. Entity metadata index — deterministic parsing layer — OPEN
**Depends on:** Task 10 (schema compliance checker shares parsing logic).
**Issue:** Several collection-level metrics (coverage matrix, FCA context,
granularity distribution) require structured metadata extracted from entity
files: H1 title, economic domain, VSM system(s), source chapter, section
presence, word counts. Currently this information exists only as prose
inside markdown files.
**Suggested fix:** Add a `parse_entity_metadata(path) -> EntityMeta`
function that extracts from each entity file:
```python
@dataclass
class EntityMeta:
slug: str
title: str # from H1
domain: str # from Economic Domain section
source_chapter: str # from Source Chapter section
definition_words: int # word count of Definition section
has_original_wording: bool # optional section present?
has_modern_interpretation: bool
vsm_systems: list[str] # from mapping file if exists
mapping_strengths: list[str]
```
Build an index of all entities at the start of each evaluation run.
This index is the input for tasks 14, 16, and 18. Expose as
`--index` CLI flag for inspection.
## 14. Redundancy detection (Concern C1) — OPEN
**Depends on:** Task 13 (metadata index).
**Methodology:** OOPS! P2 (synonymous classes) + embedding similarity +
LLM pairwise judgment. See METRICS-METHODOLOGY.md §4 C1.
**Issue:** Entities with different slugs but overlapping meanings (e.g.
`natural-rate` / `ordinary-or-average-rate`) survive extraction because
dedup only checks slug collisions. There is no semantic overlap detection.
**Suggested fix:** Implement in three stages:
1. **Embed** — Compute vector embeddings of all entity definitions using
an embedding API (OpenRouter, OpenAI, or a local sentence-transformer).
Cache embeddings in `output/metrics/embeddings.json` keyed by
`{slug: content_digest}` so unchanged entities skip re-embedding.
2. **Similarity matrix** — Compute NxN cosine similarity. Write the full
matrix to `output/metrics/similarity-matrix.json`. Flag all pairs with
cosine > 0.80 as candidates.
3. **LLM pairwise judgment** — For each candidate pair, run a prompt:
"Given these two entity definitions, are they (a) the same concept and
should be merged, (b) genuinely distinct, or (c) partially overlapping
and should be clarified?" Write results to
`output/metrics/redundancy-report.md` + YAML.
**Metrics produced:**
- `high_similarity_pairs`: count and list
- `confirmed_synonyms`: count (LLM-confirmed same concept)
- `redundancy_ratio`: `confirmed_synonyms / total_entities`
- `intensional_conciseness`: `1 - redundancy_ratio`
**CLI:** `--check-redundancy --provider <provider>`
## 15. Coverage completeness (Concern C2) — OPEN
**Depends on:** Task 13 (metadata index).
**Methodology:** SEQUAL completeness + FCA gap analysis + DSL competency
questions. See METRICS-METHODOLOGY.md §4 C2.
**Issue:** Coverage is currently assessed by the LLM in a single narrative
pass. There is no structured view of which domain × VSM cells are
populated, and no way to test whether the entity set can answer specific
questions about the economic system.
**Suggested fix:** Implement in three stages:
1. **Domain × VSM matrix** — From the metadata index, count entities per
{economic_domain, vsm_system} cell. Render as a table. Identify empty
cells as specific, actionable gaps. Compute:
- `coverage_ratio = populated_cells / total_cells`
- `vsm_balance_entropy = -Σ(pᵢ log pᵢ)` across VSM systems
2. **FCA lattice** — Construct a formal context with objects = entities,
attributes = {domain, vsm_system, source_book, abstraction_level}.
Compute the concept lattice (Python `concepts` library). Extract
attribute combinations with no corresponding entity — these are
**structural coverage gaps** not visible in the simple matrix.
3. **Competency questions** — Define a set of 15-20 canonical questions
the infospace should answer (stored in
`schemas/competency-questions.md`). Example questions:
- "How does the division of labour relate to market extent?"
- "What mechanisms regulate wages toward their natural rate?"
- "How do monopolies distort the viable system?"
LLM-Eval tests whether current entities suffice to answer each.
Unanswerable questions identify specific completeness gaps.
**Metrics produced:**
- `domain_vsm_matrix`: cell counts
- `coverage_ratio`: scalar
- `vsm_balance_entropy`: scalar
- `empty_cells`: list of {domain, vsm_system} gaps
- `fca_gap_concepts`: attribute combos with no entity
- `competency_coverage`: fraction of questions answerable
**CLI:** `--check-coverage --provider <provider>`
## 16. Structural coherence (Concern C3) — OPEN
**Depends on:** Task 13 (metadata index).
**Methodology:** OntoQA relationship richness + graph connectivity +
community detection. See METRICS-METHODOLOGY.md §4 C3.
**Issue:** It is unknown whether the 85 entities form a connected
explanatory web or a fragmented collection. No relationship graph exists
between entities.
**Suggested fix:** Implement in three stages:
1. **Explicit cross-references** — Scan each entity's definition for
mentions of other entity slugs or titles (normalised string matching).
This is deterministic and catches direct references.
2. **LLM-inferred edges** — For entity pairs not caught by string
matching but in the same domain or VSM system, LLM-Eval: "Does A's
definition conceptually depend on or explain B, or vice versa?" Run
in batches. Write the combined graph to
`output/metrics/relationship-graph.json` (adjacency list).
3. **Graph analysis** — Using networkx or equivalent:
- Connected components (target: 1)
- Graph density, average degree
- Betweenness centrality → identify bridge concepts
- Louvain community detection → compare to declared domains
- OntoQA Relationship Richness
- Cohesion per domain, coupling across domains
- Orphan entities (degree 0 or 1)
**Metrics produced:**
- `connected_components`: count (target: 1)
- `graph_density`: scalar
- `avg_degree`: scalar
- `relationship_richness`: OntoQA RR
- `modularity`: Louvain score
- `bridge_concepts`: list (high betweenness centrality)
- `orphan_entities`: list (degree ≤ 1)
- `cohesion_by_domain` / `coupling_across_domains`: scalars
**CLI:** `--check-coherence --provider <provider>`
## 17. Definitional consistency (Concern C4) — OPEN
**Depends on:** Task 16 (relationship graph — the definitional dependency
graph is a directed variant of the same structure).
**Methodology:** OntoClean metaproperties + OOPS! P24 (circular
definitions) + SEQUAL validity. See METRICS-METHODOLOGY.md §4 C4.
**Issue:** No mechanism to detect circular definitions, contradictions
between related entities, or terms used in definitions that should be
entities but aren't.
**Suggested fix:** Implement in four stages:
1. **Definitional dependency graph** — Directed version of the
relationship graph: edge A→B means A's definition uses B's concept.
Reuse cross-reference extraction from task 16.
2. **Cycle detection** — Find all cycles of length ≤ 3 in the directed
graph. Short cycles are problematic (A defines B, B defines A).
Compute `grounding_ratio`: fraction of entities traceable to terms
outside the entity set without encountering a cycle.
3. **Undefined dependencies** — Extract terms from definitions that match
entity-name patterns (capitalised noun phrases, kebab-case slugs) but
have no corresponding entity file. These are concepts the infospace
implicitly relies on but hasn't defined.
4. **LLM consistency checks** — For directly-connected entity pairs,
LLM-Eval: "Do these definitions contradict each other?" For entities
with Smith's Original Wording, LLM-Eval: "Does the definition
accurately represent the cited passage?"
**Metrics produced:**
- `circular_definitions`: count and list of cycles (length ≤ 3)
- `grounding_ratio`: fraction of entities reaching primitives
- `undefined_dependencies`: list of missing terms
- `contradiction_candidates`: LLM-flagged pairs
- `source_fidelity_score`: fraction passing source check
**CLI:** `--check-consistency --provider <provider>`
## 18. Granularity balance (Concern C5) — OPEN
**Depends on:** Task 13 (metadata index).
**Methodology:** Keet granularity theory + OntoClean rigidity +
DSL laconicity. See METRICS-METHODOLOGY.md §4 C5.
**Issue:** Entities range from broad sectors (`agriculture`) to specific
market roles (`effectual-demanders`) to abstract principles
(`division-of-labour`). It is unclear whether this range is appropriate
or whether some entities are too specific/general relative to their peers.
**Suggested fix:** Implement in three stages:
1. **LLM classification** — For each entity, LLM-Eval assigns:
- Abstraction level: `theory` / `mechanism` / `observation`
- Scope score: 1-5 (very specific → very general)
- Indispensability: 1-5 ("if removed, how much explanatory power lost?")
Write to `output/evaluations/<slug>-classification.yaml`.
2. **Distribution analysis** — Deterministic:
- Count per abstraction level; compute entropy
- Per-domain scope variance (flag domains with high variance)
- Level × domain matrix (from FCA context in task 15)
- Outlier detection: entities > 1.5σ from their domain's mean scope
3. **Merge/split recommendations** — For outlier entities, LLM-Eval:
"Should this entity be merged into a broader concept, split into
sub-concepts, or is its current granularity justified?" For entities
with indispensability ≤ 2: "Could another entity serve this purpose?"
**Metrics produced:**
- `abstraction_distribution`: {theory: n, mechanism: n, observation: n}
- `abstraction_entropy`: scalar (higher = more balanced)
- `scope_variance_by_domain`: per-domain scalar
- `dispensable_entities`: list (indispensability ≤ 2)
- `merge_candidates`: list of pairs
- `split_candidates`: list of entities
**CLI:** `--check-granularity --provider <provider>`
## 19. Unified collection evaluation command — OPEN
**Depends on:** Tasks 13-18.
**Issue:** Running five separate `--check-*` commands is cumbersome and
repeats shared computation (metadata parsing, embedding, graph building).
**Suggested fix:** Add `--evaluate-collection --provider <provider>` that
runs all five checks in sequence, sharing infrastructure:
1. Parse entity metadata index (task 13) — used by all
2. Compute embeddings (task 14) — used by C1, C3
3. Build relationship graph (task 16) — used by C3, C4
4. Run all five concern checks
5. Write per-concern reports to `output/metrics/`
6. Write unified `metrics.yaml` with all collection metrics
7. Append to `metrics-history.yaml` (task 12)
Incremental mode: `--evaluate-collection --chapter <id>` re-evaluates
only entities from that chapter plus pairwise checks involving them.
Report a summary to stdout:
```
Collection evaluation (85 entities, 7 chapters):
Redundancy: 3 synonym candidates, conciseness 0.96
Coverage: 58% VSM, 20% chapters, 4 domain gaps
Coherence: 1 component, density 0.12, 2 orphans
Consistency: 0 cycles, 5 undefined deps, 0 contradictions
Granularity: entropy 1.42, 1 dispensable, 2 merge candidates
```

View File

@@ -0,0 +1,501 @@
# Collection-Level Metrics Methodology
How we evaluate the quality of the infospace as a **collection of
interrelated concepts**, beyond the quality of individual entities.
This document describes the theoretical frameworks drawn from ontology
engineering, formal concept analysis, semiotic quality theory, and DSL
design — and how each is adapted to work within MarkiTect's two-layer
evaluation model (LLM-Eval + deterministic aggregation).
---
## 1. The Two-Layer Model
Every metric in this methodology decomposes into two layers:
| Layer | What it does | How it runs |
|-------|-------------|-------------|
| **LLM-Eval** | Qualitative judgment: "Are these two concepts the same?", "Is this definition grounded in the source?" | Prompt template → LLM → structured YAML output |
| **Deterministic** | Quantitative aggregation: cosine similarity, graph connectivity, coverage counting, cycle detection | Python code in `process_chapters.py` or dedicated `metrics.py` |
The LLM-Eval layer produces **per-entity** or **per-pair** structured
scores. The deterministic layer **aggregates** these into collection-level
metrics, persisted as machine-readable YAML alongside human-readable
markdown reports.
Per-concept quality metrics (definition precision, source grounding, VSM
relevance — see INFRA-TASKS 8-12) operate at the individual entity level.
This document covers the five **collection-level concerns** that assess how
the entities work together as an explanatory system.
---
## 2. Five Collection-Level Concerns
### Overview
| # | Concern | Question | Primary framework |
|---|---------|----------|-------------------|
| C1 | Semantic Overlap | Are there redundant concepts? | OOPS! P2, embedding similarity |
| C2 | Coverage Completeness | Does the concept set cover the domain? | SEQUAL, FCA |
| C3 | Structural Coherence | Do concepts form a connected explanatory graph? | OntoQA, graph theory |
| C4 | Definitional Consistency | Are concepts defined consistently and non-circularly? | OntoClean, OOPS! P24 |
| C5 | Granularity Balance | Are concepts at comparable levels of abstraction? | Granularity theory, DSL laconicity |
---
## 3. Theoretical Frameworks
### 3.1 SEQUAL (Semiotic Quality Framework)
**Origin:** Lindland, Sindre & Sølvberg (1994), extended by Krogstie et al.
**What it defines:** Quality of a conceptual model as the correspondence
between three worlds — the domain (what exists), the model (what we
captured), and the audience's interpretation (what they understand).
Two key dimensions of **semantic quality**:
- **Validity** — everything in the model corresponds to something real
in the domain. No invented concepts.
- **Completeness** — everything relevant in the domain is represented in
the model. No missing concepts.
**How we use it:** SEQUAL frames our entire metrics approach. Every
collection-level metric maps to one of these dimensions:
| SEQUAL dimension | Our concerns |
|-----------------|--------------|
| Validity | C1 (redundancy reduces validity — duplicate concepts don't correspond to distinct domain facts), C4 (consistency — contradictory definitions can't both be valid) |
| Completeness | C2 (coverage — are all needed concepts present?), C5 (granularity — missing levels of abstraction are completeness gaps) |
| Both | C3 (coherence — disconnected concepts suggest either missing bridging concepts [completeness] or misplaced concepts [validity]) |
**Adaptation:** SEQUAL was designed for formal models evaluated by human
experts. We replace human judgment with LLM-Eval (for validity checks like
"does this concept correspond to something Smith actually described?") and
deterministic counting (for completeness checks like "which VSM systems
lack entity mappings?").
### 3.2 OntoClean
**Origin:** Guarino & Welty (2004).
**What it defines:** A methodology for validating taxonomic relationships
by assigning **metaproperties** to each concept:
- **Rigidity** — Is the property essential to all its instances? (e.g.
"market" is rigid; "effectual demander" is anti-rigid — an agent can
stop being an effectual demander)
- **Identity** — Does the concept carry an identity criterion? (e.g.
"division of labour" can be identified by its three causal mechanisms)
- **Unity** — Are all instances of this concept whole in the same way?
- **Dependence** — Does the concept require another concept to exist?
(e.g. "market price" depends on "effectual demand")
**Constraint:** A rigid concept cannot be subsumed by an anti-rigid one.
Violations indicate structural confusion.
**How we use it:** We do not have a formal taxonomy, but our flat entity
set implicitly contains subsumption relationships (e.g. "natural rate"
subsumes "ordinary-or-average rate"). OntoClean metaproperties help detect:
- **Granularity mismatches** (C5): A rigid concept at the same level as
an anti-rigid one suggests different abstraction levels are mixed.
- **Definitional consistency** (C4): If entity A depends on entity B per
OntoClean, but B's definition doesn't acknowledge A, the definitions
are inconsistent.
- **Redundancy** (C1): Two entities with identical metaproperty profiles
and overlapping definitions are candidates for merging.
**Adaptation:** Instead of manual metaproperty assignment, we use LLM-Eval
to classify each entity's rigidity, identity criterion, and dependencies.
The constraint checking is then deterministic.
### 3.3 OOPS! (Ontology Pitfall Scanner)
**Origin:** Poveda-Villalón et al. (2014). Catalogue of 41 common
ontology design pitfalls.
**What it defines:** Concrete, testable anti-patterns. The pitfalls most
relevant to our infospace:
| Pitfall | Description | Our concern |
|---------|-------------|-------------|
| P2 | Synonymous classes — different names, same meaning | C1 (redundancy) |
| P4 | Unconnected ontology elements | C3 (coherence) |
| P6 | Missing inverse relationships | C3 |
| P7 | Merging different concepts in the same class | C5 (granularity — too coarse) |
| P11 | Missing domain or range | C4 (consistency) |
| P19 | Missing disjointness axioms | C1 (how do we know two concepts don't overlap?) |
| P24 | Recursive/circular definition | C4 (consistency) |
| P25 | Inverse of itself | C4 |
**How we use it:** OOPS! pitfalls become a **checklist for LLM-Eval
prompts**. Rather than running a formal OWL scanner, we ask the LLM to
check for each pitfall pattern:
- "Are entities A and B synonymous?" (P2)
- "Does entity A's definition reference itself?" (P24)
- "Is entity A actually two distinct concepts merged together?" (P7)
The deterministic layer counts pitfall occurrences and tracks them over
time.
**Adaptation:** We select the subset of OOPS! pitfalls applicable to
semi-formal markdown-based ontologies (no OWL axioms) and implement each
as an LLM-Eval prompt pattern rather than a formal reasoner check.
### 3.4 OntoQA (Metric-Based Ontology Quality Analysis)
**Origin:** Tartir & Arpinar (2007).
**What it defines:** Quantitative schema-level and instance-level metrics:
- **Relationship Richness (RR):** Proportion of non-taxonomic (lateral)
relationships to total relationships. `RR = non_hierarchical / total`.
Low RR = mere taxonomy. High RR = rich cross-cutting connections.
- **Attribute Richness (AR):** Average number of attributes per concept.
`AR = total_attributes / total_concepts`.
- **Inheritance Richness (IR):** Average subclasses per class — measures
how knowledge distributes across the hierarchy.
- **Class Richness (CR):** Proportion of classes with instances.
**How we use it:** Our entities don't have formal relationships declared
between them, but we can **infer** a relationship graph from their
definitions and mappings:
- Entity A references entity B in its definition → definitional dependency
- Entities A and B map to the same VSM system → structural co-occurrence
- Entities A and B appear in the same chapter → contextual co-occurrence
From this inferred graph, we compute OntoQA metrics directly:
- **Relationship Richness** tells us whether our concepts form a web of
explanatory connections or just a flat list.
- **Attribute Richness** maps to our schema sections — entities with more
optional sections filled (Original Wording, Modern Interpretation) are
richer.
**Adaptation:** The key modification is that relationship inference is an
LLM-Eval step (pairwise: "does A's definition depend on or reference B?"),
after which all OntoQA metrics are computed deterministically on the
resulting graph.
### 3.5 Formal Concept Analysis (FCA)
**Origin:** Wille (1982). Applied to ontology auditing by Elhaj et al.
(2008) for SNOMED CT completeness checking.
**What it defines:** A mathematical framework for deriving a **concept
lattice** from a binary relation between objects and attributes. The
lattice reveals:
- **Formal concepts**: maximal sets of objects sharing the same attributes
- **Subconcept/superconcept** relationships: the natural hierarchy
- **Missing concepts**: attribute combinations with no corresponding object
**How we use it:** We construct a **formal context** (binary matrix):
- **Objects** = our 85 entities
- **Attributes** = economic domain, VSM system, source book, abstraction
level (from LLM-Eval), key terms (extracted from definitions)
The concept lattice then reveals:
- **Coverage gaps** (C2): Attribute combinations with no entity. E.g. if
the cell {Distribution, S3} is empty, we lack control-layer concepts
for distribution — a specific, actionable gap.
- **Redundancy** (C1): Entities with identical attribute sets (same formal
concept) are candidates for merging.
- **Granularity** (C5): The lattice depth indicates how many meaningful
levels of abstraction exist. A shallow lattice suggests missing
intermediate concepts.
**Adaptation:** Classic FCA requires crisp binary attributes. Our domains
and VSM mappings are already categorical, but abstraction level and key
terms need LLM-Eval to produce. The lattice computation itself is
deterministic (Python `concepts` library or equivalent). The FCA approach
replaces the current "ask the LLM about coverage" with a structural
computation that can identify *specific* gaps rather than vague
recommendations.
### 3.6 DSL Design Principles
**Origin:** Mernik et al. (2005) "When and How to Develop DSLs";
Karsai et al. (2014) "Design Guidelines for Domain-Specific Languages".
**What they define:** Quality criteria for a set of concepts that form a
language for a specific domain:
- **Soundness**: Every concept in the language corresponds to a real domain
concern (no invented abstractions).
- **Completeness**: The language can express everything needed for its
intended tasks.
- **Laconicity**: No unnecessary concepts — every concept earns its place.
- **Orthogonality**: Concepts are independent; combining any two produces
a meaningful result (no redundant combinations).
**How we use it:** Our entity set is effectively a domain-specific
vocabulary for "explaining classical economics through VSM". DSL quality
criteria translate directly:
- **Soundness** → Validity (SEQUAL): every entity grounded in Smith's text
- **Completeness** → Coverage (C2): can we answer the "competency
questions" the infospace is meant to address?
- **Laconicity** → Anti-redundancy (C1) + Indispensability (C5): would
removing any entity lose explanatory power?
- **Orthogonality** → Non-overlap (C1): entity definitions don't
substantially duplicate each other
**Adaptation:** We operationalise DSL completeness through **competency
questions** — a set of canonical questions the infospace should be able to
answer (e.g. "How does the division of labour relate to market extent?",
"What mechanisms regulate wages toward their natural rate?"). LLM-Eval
tests whether the current entity set suffices to answer each question.
Unanswerable questions identify specific completeness gaps.
Laconicity is operationalised as **indispensability scoring**: for each
entity, LLM-Eval rates whether removing it would lose explanatory power.
Low-scoring entities are candidates for merging or retirement.
---
## 4. Integration: Metric Definitions by Concern
### C1: Semantic Overlap / Redundancy
**Goal:** Identify entities that substantially overlap in meaning and
should be merged, distinguished, or retired.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `similarity_matrix` | Deterministic | Embed all entity definitions; compute NxN cosine similarity |
| `high_similarity_pairs` | Deterministic | Pairs with cosine > 0.80, sorted descending |
| `confirmed_synonyms` | LLM-Eval | For each high-similarity pair, LLM judges: "same concept" / "genuinely distinct" / "partial overlap" |
| `redundancy_ratio` | Deterministic | `confirmed_synonyms / total_entities` |
| `intensional_conciseness` | Deterministic | `1 - redundancy_ratio` (from KG quality framework) |
**Pipeline:**
1. Embed definitions (embedding API or local model)
2. Compute cosine similarity matrix
3. Filter pairs above threshold
4. LLM pairwise judgment on filtered pairs only (avoids N² LLM calls)
5. Aggregate into ratio and conciseness score
**Output:** `output/metrics/redundancy-report.md` + structured YAML with
pair list, scores, and merge/retire recommendations.
### C2: Coverage Completeness
**Goal:** Identify domain areas and VSM systems that lack adequate
representation in the entity set.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `domain_vsm_matrix` | Deterministic | Count entities per {economic_domain, VSM_system} cell |
| `coverage_ratio` | Deterministic | `populated_cells / expected_cells` |
| `vsm_balance_entropy` | Deterministic | Shannon entropy of entity distribution across VSM systems (higher = more balanced) |
| `empty_cells` | Deterministic | List of {domain, VSM_system} pairs with zero entities |
| `competency_coverage` | LLM-Eval | For each competency question, can it be answered with current entities? |
| `fca_gap_concepts` | Deterministic | Attribute combinations in the FCA lattice with no corresponding entity |
**Pipeline:**
1. Parse entity metadata (domain, VSM mapping) from files on disk
2. Build domain × VSM matrix; identify empty cells
3. Build FCA formal context; compute lattice; extract gap concepts
4. Define competency questions (initially hand-written, later LLM-generated
from the source material)
5. LLM-evaluate answerability of each question
6. Aggregate into coverage ratio, entropy, and gap list
**Output:** `output/metrics/coverage-report.md` + YAML with matrix, gaps,
and competency question results.
### C3: Structural Coherence
**Goal:** Determine whether the entities form a connected explanatory web
or a fragmented collection of isolated concepts.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `relationship_graph` | LLM-Eval + Deterministic | Infer edges from definition cross-references (string matching) + LLM judgment for implicit references |
| `connected_components` | Deterministic | Number of connected components in the graph (target: 1) |
| `graph_density` | Deterministic | `actual_edges / possible_edges` |
| `avg_degree` | Deterministic | `total_edges / total_entities` |
| `relationship_richness` | Deterministic | OntoQA RR: `non_hierarchical_edges / total_edges` |
| `modularity` | Deterministic | Louvain modularity score (0.3-0.7 = meaningful structure; >0.8 = fragmentation) |
| `bridge_concepts` | Deterministic | Entities with highest betweenness centrality (connect clusters) |
| `orphan_entities` | Deterministic | Entities with degree 0 or 1 |
| `cohesion_by_domain` | Deterministic | Avg intra-domain edges per entity |
| `coupling_across_domains` | Deterministic | Inter-domain edges / total edges |
**Pipeline:**
1. Extract explicit cross-references from definitions (entity name
mentions in other definitions — string matching with slug normalisation)
2. For entity pairs not caught by string matching, LLM-Eval: "Does A's
definition depend on or reference B's concept?"
3. Build directed graph
4. Compute graph metrics (networkx or equivalent)
5. Run community detection; compare detected communities to declared
economic domains
**Output:** `output/metrics/coherence-report.md` + YAML with graph
statistics, orphan list, bridge concepts, and community structure.
### C4: Definitional Consistency
**Goal:** Ensure entities are defined consistently, non-circularly, and
without contradicting each other.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `definitional_dependency_graph` | Deterministic + LLM-Eval | Edges where A's definition uses B's concept |
| `circular_definitions` | Deterministic | Cycles of length ≤ 3 in the dependency graph |
| `definition_depth` | Deterministic | Longest dependency chain per entity before reaching a term not in the entity set |
| `undefined_dependencies` | Deterministic | Terms used in definitions that arguably should be entities but aren't |
| `pairwise_consistency` | LLM-Eval | For related entity pairs (sharing edges): "Do these definitions contradict each other?" |
| `source_fidelity` | LLM-Eval | "Does this definition accurately represent what Smith wrote in the cited passage?" |
| `metaproperty_violations` | LLM-Eval + Deterministic | OntoClean constraint checking after LLM classifies rigidity/identity |
| `grounding_ratio` | Deterministic | Fraction of entities traceable to primitives without cycles |
**Pipeline:**
1. Build definitional dependency graph (same technique as C3, but directed
— A depends on B means A's definition uses B, not vice versa)
2. Detect cycles; flag short cycles
3. Extract undefined terms (terms matching entity-name patterns that appear
in definitions but have no corresponding entity file)
4. LLM pairwise consistency check on directly-connected pairs
5. LLM source fidelity check (compare definition to source chapter text)
6. LLM OntoClean metaproperty classification; deterministic constraint
checking
**Output:** `output/metrics/consistency-report.md` + YAML with cycle list,
undefined terms, contradiction candidates, and metaproperty violations.
### C5: Granularity Balance
**Goal:** Ensure entities operate at comparable levels of abstraction
within their respective domains and perspectives.
**Metrics:**
| Metric | Type | Computation |
|--------|------|-------------|
| `abstraction_classification` | LLM-Eval | Classify each entity as theory-level / mechanism-level / observation-level |
| `scope_score` | LLM-Eval | Rate each entity 1-5 for generality (1 = very specific instance, 5 = broad theoretical principle) |
| `abstraction_distribution` | Deterministic | Count per level; compute entropy |
| `scope_variance` | Deterministic | Variance of scope scores within each domain |
| `level_x_perspective_matrix` | Deterministic | Cross-tabulation of abstraction level × economic domain |
| `indispensability` | LLM-Eval | "If removed, what explanatory power is lost?" (1-5) |
| `dispensable_entities` | Deterministic | Entities with indispensability score ≤ 2 |
| `merge_candidates` | LLM-Eval | Pairs where one is a sub-case of the other |
**Pipeline:**
1. LLM-classify each entity: abstraction level, scope score,
indispensability
2. Build level × perspective matrix
3. Compute distribution entropy and per-domain scope variance
4. Flag outliers: entities whose scope score deviates > 1.5σ from their
domain mean
5. For outlier entities, LLM-Eval: "Should this be merged into a broader
concept, or split into sub-concepts?"
**Output:** `output/metrics/granularity-report.md` + YAML with
classifications, distribution, outliers, and merge/split recommendations.
---
## 5. Shared Infrastructure
Several concerns share underlying computations:
| Infrastructure | Used by | Build once |
|---------------|---------|------------|
| Definition embeddings (vector per entity) | C1, C3 | Embedding API call per entity |
| Relationship graph (entity → entity edges) | C3, C4 | String matching + LLM-Eval |
| FCA formal context (entity × attribute matrix) | C2, C5 | Metadata parsing + LLM classification |
| Entity metadata index (domain, VSM, chapter, sections) | C2, C5, C10 (schema compliance) | Deterministic markdown parsing |
These should be computed once per evaluation run and cached for use by
all concern-specific metrics.
---
## 6. Evaluation Workflow
A full collection-level evaluation run:
```
process_chapters.py --evaluate-collection --provider <provider>
```
1. **Parse** — deterministic metadata extraction from all entity files
2. **Embed** — compute definition embeddings (cached; only new/changed
entities need fresh embeddings)
3. **Infer** — LLM-Eval for relationship edges, metaproperties,
abstraction levels, pairwise judgments (batched to minimise LLM calls)
4. **Compute** — deterministic graph metrics, FCA lattice, coverage
matrix, similarity matrix, cycle detection
5. **Aggregate** — combine per-entity and per-pair scores into
collection-level metrics
6. **Report** — write per-concern markdown reports + unified `metrics.yaml`
7. **Append** — add timestamped snapshot to `metrics-history.yaml`
Incremental mode (`--evaluate-collection --chapter <id>`) re-evaluates
only the entities introduced or modified by that chapter, plus any
pairwise checks involving those entities.
---
## 7. References
- Lindland, O.I., Sindre, G. & Sølvberg, A. (1994). "Understanding
Quality in Conceptual Modeling." *IEEE Software* 11(2), 42-49.
→ SEQUAL framework: validity and completeness dimensions.
- Guarino, N. & Welty, C.A. (2004). "An Overview of OntoClean." In
*Handbook on Ontologies*, Springer, 151-171.
→ Metaproperty analysis: rigidity, identity, unity, dependence.
- Poveda-Villalón, M., Gómez-Pérez, A. & Suárez-Figueroa, M.C. (2014).
"OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology
Evaluation." *IJSWIS* 10(2), 7-34.
→ Pitfall catalogue: 41 anti-patterns for ontology design.
- Tartir, S. & Arpinar, I.B. (2007). "Ontology Evaluation and Ranking
using OntoQA." *ICSC 2007*, IEEE, 185-192.
→ Schema metrics: relationship richness, attribute richness.
- Wille, R. (1982). "Restructuring Lattice Theory." In *Ordered Sets*,
Reidel, 445-470.
→ Formal Concept Analysis: concept lattices from binary contexts.
- Elhaj, H. et al. (2008). "Auditing SNOMED CT with Formal Concept
Analysis." *AMIA Annual Symposium*, PMC2605587.
→ FCA for ontology completeness auditing.
- Keet, C.M. (2008). *A Formal Theory of Granularity.* PhD thesis,
Free University of Bozen-Bolzano.
→ Granularity levels and perspectives for ontology design.
- Mernik, M., Heering, J. & Sloane, A.M. (2005). "When and How to
Develop Domain-Specific Languages." *ACM Computing Surveys* 37(4),
316-344.
→ DSL design: soundness, completeness, laconicity.
- Karsai, G. et al. (2014). "Design Guidelines for Domain Specific
Languages." *arXiv:1409.2378*.
→ Orthogonality, necessary-and-sufficient principle.
- Xue, B. & Zou, L. (2022). "Knowledge Graph Quality Management: A
Comprehensive Survey." *IEEE TKDE* 35(5), 4969-4988.
→ KG quality dimensions: conciseness, consistency, completeness.

View File

@@ -0,0 +1,621 @@
# Viable Infospace Tooling — Roadmap
## Vision
An **infospace** is a structured, evaluable, composable collection of
concepts that explains a **topic** through the lens of one or more
**disciplines**. Infospaces are the unit of knowledge work in MarkiTect.
This roadmap organises the work needed to move from the current
ad-hoc example (`infospace-with-history`) to a general-purpose platform
for creating, evaluating, maintaining, and composing infospaces.
---
## Terminology
These terms establish the vocabulary for infospace tooling. They
generalise from the Wealth of Nations / VSM example but are not
specific to it.
### Infospace
A curated, self-describing collection of **entities** (concepts,
mechanisms, observations) that together explain a **topic**. An
infospace has:
- A **topic** — the subject matter being explained (e.g. "The Wealth
of Nations", "cellular biology", "Kubernetes networking")
- One or more **disciplines** — external frameworks applied as lenses
(e.g. "Viable System Model", "category theory")
- **Entities** — the atomic units of knowledge, each with a definition,
provenance, and quality scores
- **Schemas** — structural templates that define what a well-formed
entity, mapping, or analysis looks like
- **Evaluations** — per-entity and collection-level quality assessments
- **Metrics** — quantitative indicators of completeness, coherence,
consistency, and granularity balance
An infospace is **viable** when it meets threshold scores across its
defined metrics — it is fit for purpose as an explanatory tool.
### Topic
The subject matter an infospace is built to explain. A topic sits
within a **domain** (broader field of knowledge) but is more specific:
- Domain: Economics → Topic: The Wealth of Nations
- Domain: Systems Theory → Topic: Viable System Model
- Domain: Computer Science → Topic: Distributed consensus protocols
A topic provides the **source material** — the texts, data, or
observations from which entities are extracted.
### Discipline
A reusable framework of concepts applied as a lens to explore a topic.
A discipline is itself an infospace — one that has been evaluated as
viable and packaged for reuse.
In our example, the VSM is the discipline: a set of concepts (S1-S5,
recursion, variety, viability) from systems theory, applied to the
economic concepts in Smith's work.
**Key property:** Disciplines compose. An infospace built with one
discipline can itself become a discipline for another infospace. The
Wealth of Nations infospace, viewed through VSM, could become a
discipline applied to a modern supply chain analysis.
### Entity
The atomic unit of an infospace. An entity has:
- **Identity**: a unique slug and human-readable title
- **Definition**: a precise, non-circular explanation
- **Provenance**: the source chapter, passage, and extraction context
- **Domain placement**: which area of the topic it belongs to
- **Discipline mapping**: how it connects to the applied discipline
(e.g. which VSM system)
- **Quality scores**: per-entity LLM-evaluated metrics
- **Lifecycle state**: active, archived (with reason), or draft
### Evaluation
A structured assessment of quality, applied at two levels:
- **Per-entity evaluation**: scores an individual entity against
quality rubrics defined in its schema (definition precision, source
grounding, discipline relevance, etc.)
- **Collection evaluation**: scores the entity set as a whole against
five concerns: redundancy, coverage, coherence, consistency, and
granularity balance
Evaluations are always performed by **delegated LLM calls** through
MarkiTect's LLM integration — never by the coding agent working on
infrastructure. This separation ensures that domain-level judgment
stays in the problem space, not the tooling space.
### Viability
An infospace is viable when:
1. Its entities individually meet quality thresholds (per-entity eval)
2. Its collection metrics are within acceptable ranges
3. It can answer its defined **competency questions** — the canonical
queries the infospace is meant to support
4. It has been evaluated recently enough that metrics reflect current
content
Viability is not binary — it is a profile of scores that the user
sets thresholds for based on their needs.
---
## Architecture: Three Layers
```
┌──────────────────────────────────────────────────┐
│ Layer 3: Infospace Instances │
│ Specific infospaces built by users │
│ (Wealth of Nations + VSM, supply chain + ...) │
│ Works IN an infospace │
├──────────────────────────────────────────────────┤
│ Layer 2: Infospace Tooling │
│ Terminology, primitives, composition model │
│ CLI: infospace create/evaluate/compose/... │
│ Works WITH infospaces │
├──────────────────────────────────────────────────┤
│ Layer 1: MarkiTect Platform │
│ Artifacts, prompts, LLM, spaces, graph, embed │
│ Provides FOR infospaces │
└──────────────────────────────────────────────────┘
```
### Boundary condition: LLM delegation
All LLM-based evaluation (entity scoring, pairwise judgments, coverage
analysis) is delegated to MarkiTect's LLM integration module. The coding
agent that works on infrastructure never makes domain-level judgments
itself. This keeps a clean separation:
- **Coding agent** → writes Python, templates, schemas, tests
- **MarkiTect LLM** → evaluates entities, judges redundancy, assesses
coverage, checks consistency
The infospace tooling (Layer 2) orchestrates these LLM calls through
prompt templates and the prompt execution engine, not through ad-hoc
prompting.
---
## Stage 1: MarkiTect Platform Additions
Infrastructure that must exist before infospace tooling can be built.
These are general-purpose platform capabilities, not infospace-specific.
### S1.1 — Entity metadata parser
Add a deterministic markdown parser that extracts structured metadata
from entity files: H1 title, sections present, word counts, domain,
source chapter. Returns a dataclass usable by all downstream metrics.
**Maps to:** INFRA-TASKS #13, #10
**Location:** `markitect/prompts/quality/` or new `markitect/analysis/`
**Depends on:** Nothing — can start immediately
**Deliverable:** `parse_entity_metadata(path) -> EntityMeta` function
with tests
### S1.2 — Schema compliance validator
Deterministic validation of entity/mapping files against their schemas:
section presence, word count ranges, heading format, enum values. No
LLM needed.
**Maps to:** INFRA-TASKS #10
**Location:** `markitect/prompts/quality/validator.py` (extend existing)
**Depends on:** S1.1
**Deliverable:** `validate_document(path, schema) -> ValidationResult`
with tests
### S1.3 — Embedding adapter
Add embedding support to `markitect/llm/`. Needs:
- `EmbeddingAdapter` interface: `embed(texts: list[str]) -> list[list[float]]`
- `OpenRouterEmbeddingAdapter` implementation (or OpenAI embedding endpoint)
- Caching layer: store embeddings keyed by `{slug: content_digest}` so
unchanged entities skip re-embedding
- Cosine similarity utility: `similarity_matrix(embeddings) -> np.ndarray`
**Maps to:** INFRA-TASKS #14 (prerequisite)
**Location:** `markitect/llm/embeddings.py`
**Depends on:** Nothing — can start immediately
**Deliverable:** Embedding adapter + cache + similarity computation, with
tests
### S1.4 — Graph analysis utilities
The existing `DependencyGraph` supports basic traversal and cycle
detection. Collection-level metrics need richer analysis:
- Connected components
- Betweenness centrality
- Community detection (Louvain or label propagation)
- Modularity score
- Degree distribution
- Cohesion/coupling computation
Decide: extend `DependencyGraph` or add a lightweight wrapper that
converts to networkx (adding it as an optional dependency).
**Maps to:** INFRA-TASKS #16 (prerequisite)
**Location:** `markitect/prompts/dependencies/analysis.py` or new
`markitect/analysis/graph.py`
**Depends on:** Nothing — can start immediately
**Deliverable:** Graph analysis functions with tests
### S1.5 — Structured evaluation output
Define a standard format for evaluation results: YAML front-matter +
markdown body. Add utilities for:
- Writing evaluation results (per-entity, per-pair, collection-level)
- Reading/parsing evaluation results back into dataclasses
- Appending timestamped snapshots to a history file
- Diffing two snapshots
**Maps to:** INFRA-TASKS #11, #12
**Location:** `markitect/prompts/quality/` or `markitect/analysis/`
**Depends on:** S1.1
**Deliverable:** `EvaluationResult` model + read/write utilities with
tests
### S1.6 — Batch LLM evaluation orchestrator
A pipeline component that runs an evaluation prompt template against a
batch of entities (or entity pairs), collecting structured results.
Must handle:
- Rate limiting and retry (reuse existing adapter logic)
- Progress reporting
- Incremental evaluation (skip entities whose content hasn't changed
since last eval)
- Result aggregation
This is the mechanism by which infospace tooling delegates LLM work
to the platform.
**Maps to:** INFRA-TASKS #9 (prerequisite)
**Location:** `markitect/prompts/execution/batch.py`
**Depends on:** S1.5
**Deliverable:** `BatchEvaluator` class with tests
### S1.7 — FCA computation
Formal Concept Analysis: build a formal context (entity × attribute
matrix), compute the concept lattice, extract gap concepts. Either
implement a minimal FCA algorithm or integrate a library.
**Maps to:** INFRA-TASKS #15 (prerequisite)
**Location:** `markitect/analysis/fca.py`
**Depends on:** S1.1
**Deliverable:** `FormalContext`, `ConceptLattice`, `find_gap_concepts()`
with tests
### Summary: Stage 1 dependency graph
```
S1.1 Entity metadata parser ──┬── S1.2 Schema validator
├── S1.5 Eval output format ── S1.6 Batch evaluator
└── S1.7 FCA computation
S1.3 Embedding adapter ──────── (independent)
S1.4 Graph analysis ─────────── (independent)
```
S1.1, S1.3, and S1.4 can proceed in parallel. S1.6 (batch evaluator) is
the final piece needed before Stage 2 can begin.
---
## Stage 2: Infospace Tooling
The user-facing layer that provides documented primitives for working
with infospaces. Built on top of Stage 1 infrastructure and the existing
`markitect/spaces/` module.
### S2.1 — Infospace model and configuration
Define the `Infospace` as a first-class concept that extends the existing
`InformationSpace` with:
- **Topic declaration**: name, domain, source material reference
- **Discipline bindings**: which external infospaces are applied as lenses
- **Schema registry**: which schemas govern entity structure
- **Competency questions**: what the infospace should be able to answer
- **Viability thresholds**: minimum acceptable metric scores
- **Evaluation state**: latest per-entity and collection scores
Configuration format: a `infospace.yaml` (or section in existing config)
that declares all of the above.
**Location:** new `markitect/infospace/` package
**Depends on:** S1.1, S1.5, existing `markitect/spaces/`
**Deliverable:** `InfospaceConfig`, `InfospaceState` models + loader
### S2.2 — Infospace lifecycle commands
CLI commands for the core lifecycle:
```bash
# Initialise a new infospace
markitect infospace init --topic "Wealth of Nations" \
--domain "Economics" \
--discipline vsm-framework
# Show infospace status (entity count, eval state, viability)
markitect infospace status
# List entities with quality summary
markitect infospace entities [--sort-by score|domain|chapter]
# Show viability dashboard
markitect infospace viability
```
These commands read the `infospace.yaml` config and present information
from the metadata index and evaluation results.
**Location:** `markitect/infospace/cli.py` integrated into main CLI
**Depends on:** S2.1
**Deliverable:** CLI commands with help text and tests
### S2.3 — Per-entity evaluation primitives
Prompt templates and CLI commands for evaluating individual entities:
```bash
# Evaluate all entities
markitect infospace evaluate --provider openrouter
# Evaluate entities from a specific chapter
markitect infospace evaluate --chapter book-1-chapter-05 --provider openrouter
# Re-evaluate a single entity
markitect infospace evaluate --entity division-of-labour --provider openrouter
```
Uses the batch evaluator (S1.6) to run the evaluate-entity prompt
template (defined in the infospace's schema directory) against entities.
Writes structured results to `output/evaluations/`.
**Maps to:** INFRA-TASKS #8, #9
**Location:** `markitect/infospace/evaluation.py`
**Depends on:** S1.6, S2.1
**Deliverable:** Per-entity evaluation pipeline + CLI + prompt template
### S2.4 — Collection-level checks
CLI commands for each of the five collection concerns:
```bash
# Run all collection checks
markitect infospace check --provider openrouter
# Run specific checks
markitect infospace check redundancy --provider openrouter
markitect infospace check coverage --provider openrouter
markitect infospace check coherence --provider openrouter
markitect infospace check consistency --provider openrouter
markitect infospace check granularity --provider openrouter
```
Each check uses Stage 1 infrastructure (embeddings, graph analysis, FCA)
and delegates LLM judgment to the platform. Results written to
`output/metrics/` as per-concern reports + unified `metrics.yaml`.
**Maps to:** INFRA-TASKS #14-19
**Location:** `markitect/infospace/checks/` (one module per concern)
**Depends on:** S1.3, S1.4, S1.6, S1.7, S2.1
**Deliverable:** Five check modules + unified orchestrator + CLI
### S2.5 — Metrics history and viability tracking
Track metrics over time. After each evaluation or check run, append a
timestamped snapshot to `metrics-history.yaml`. Provide commands to
review trends:
```bash
# Show metrics history
markitect infospace history
# Compare two snapshots
markitect infospace history diff 2026-02-18 2026-03-01
# Check viability against thresholds
markitect infospace viability
```
Viability is assessed by comparing current metrics to the thresholds
declared in `infospace.yaml`. A simple pass/fail per metric with the
actual value.
**Maps to:** INFRA-TASKS #12
**Location:** `markitect/infospace/history.py`
**Depends on:** S2.4, S1.5
**Deliverable:** History tracking + viability assessment + CLI
### S2.6 — Infospace composition model
The mechanism by which one infospace is applied as a discipline to
another. Builds on `markitect/spaces/composability/`:
- **Discipline binding**: declare that infospace A uses infospace B as a
discipline. B's entities become available as mapping targets.
- **Cross-infospace references**: entity in A maps to concept in B using
the same mapping schema and evaluation pipeline.
- **Discipline viability requirement**: B must be viable (meets its own
thresholds) before it can be used as a discipline for A.
- **Cascading evaluation**: when B's entities change, A's mappings that
reference them are flagged for re-evaluation.
```bash
# Bind a discipline to the current infospace
markitect infospace bind-discipline ./path/to/vsm-infospace
# List bound disciplines and their viability
markitect infospace disciplines
# Check for stale mappings after discipline update
markitect infospace check stale-mappings
```
**Location:** `markitect/infospace/composition.py`
**Depends on:** S2.1, existing `markitect/spaces/composability/`
**Deliverable:** Composition model + CLI + documentation
### S2.7 — Documentation: Infospace Primitives Reference
A reference document explaining all primitives, their purpose, and how
they compose. This is the user-facing documentation for the infospace
tooling layer — the equivalent of a framework guide.
**Location:** `docs/infospace-primitives.md` or in-CLI help
**Depends on:** S2.1-S2.6
**Deliverable:** Reference documentation
### Summary: Stage 2 dependency graph
```
S2.1 Model & config ──┬── S2.2 Lifecycle CLI
├── S2.3 Per-entity evaluation
├── S2.4 Collection checks ── S2.5 History & viability
└── S2.6 Composition model
S2.7 Documentation (depends on all above)
```
---
## Stage 3: Example Revision
Revisit the Wealth of Nations / VSM example using the new tooling.
The example becomes both a tutorial and a validation of the tooling.
### S3.1 — Migrate example to infospace configuration
Replace the ad-hoc `process_chapters.py` setup with a declarative
`infospace.yaml`:
```yaml
topic:
name: "The Wealth of Nations"
domain: "Classical Economics"
sources: artifacts/sources/
disciplines:
- name: "Viable System Model"
path: artifacts/vsm-reference/
schemas:
entity: schemas/economic-entity-schema-v1.0.md
mapping: schemas/vsm-mapping-schema-v1.0.md
analysis: schemas/chapter-analysis-schema-v1.0.md
competency_questions: schemas/competency-questions.md
viability:
redundancy_ratio: { max: 0.05 }
coverage_ratio: { min: 0.60 }
coherence_components: { max: 1 }
consistency_cycles: { max: 0 }
granularity_entropy: { min: 1.0 }
per_entity_mean: { min: 3.5 }
pipeline:
stages:
- template: extract-entities
spaces: [sources, guidelines, vsm-reference, entities]
- template: map-to-vsm
spaces: [entities, vsm-reference, guidelines]
- template: synthesize-analysis
spaces: [sources, entities, mappings, vsm-reference]
post_batch:
- template: assess-metrics
spaces: [analyses, vsm-reference]
```
**Depends on:** S2.1
**Deliverable:** `infospace.yaml` + migration of `process_chapters.py` to
use infospace tooling APIs
### S3.2 — Clean per-chapter git history
Re-run all processed chapters (and remaining ones) with per-chapter
commits on a clean branch, then replace the current tangled history.
**Maps to:** INFRA-TASKS #4, #7
**Depends on:** S3.1
**Deliverable:** Clean branch with one commit per chapter
### S3.3 — Full evaluation run
Run all per-entity evaluations and collection checks on the completed
infospace. Establish baseline metrics. Demonstrate the viability
dashboard.
**Maps to:** INFRA-TASKS #6
**Depends on:** S2.3, S2.4, S2.5, S3.2
**Deliverable:** Complete evaluation results + viability report
### S3.4 — Rewrite tutorial
Update `TUTORIAL.md` to use infospace tooling commands instead of
raw `process_chapters.py` invocations. The tutorial should walk
through:
1. Initialising an infospace (`markitect infospace init`)
2. Defining schemas and competency questions
3. Processing chapters (pipeline execution)
4. Evaluating entities (`markitect infospace evaluate`)
5. Running collection checks (`markitect infospace check`)
6. Reviewing viability (`markitect infospace viability`)
7. Iterating: refining guidelines, re-processing, re-evaluating
8. Using the infospace as a discipline for a new project
**Depends on:** S3.1-S3.3
**Deliverable:** Revised `TUTORIAL.md`
### S3.5 — Demonstrate composition
Create a minimal second infospace (e.g. a modern supply chain case
study or a different economic text) that binds the Wealth of Nations
infospace as a discipline. Demonstrates the composition model from S2.6.
**Depends on:** S2.6, S3.3
**Deliverable:** Second example infospace + composition tutorial section
---
## Task Mapping
Cross-reference between INFRA-TASKS numbers and roadmap stages:
| INFRA-TASK | Description | Stage |
|------------|-------------|-------|
| 1-3 | Infra fixes (resolved) | — |
| 4 | Per-chapter git history | S3.2 |
| 5 | Prompt file side-effects | S1.6 (batch eval avoids this) |
| 6 | Stale metrics | S3.3 |
| 7 | Remaining 28 chapters | S3.2 |
| 8 | Per-concept quality metrics in schema | S2.3 |
| 9 | Evaluate-entity prompt template | S2.3 |
| 10 | Deterministic schema compliance | S1.2 |
| 11 | Structured metrics output | S1.5 |
| 12 | Metrics-over-time tracking | S2.5 |
| 13 | Entity metadata index | S1.1 |
| 14 | Redundancy detection (C1) | S2.4 |
| 15 | Coverage completeness (C2) | S2.4 |
| 16 | Structural coherence (C3) | S2.4 |
| 17 | Definitional consistency (C4) | S2.4 |
| 18 | Granularity balance (C5) | S2.4 |
| 19 | Unified collection evaluation | S2.4 |
---
## Implementation Order
Recommended sequence, accounting for dependencies and value delivery:
**Phase A — Foundation (Stage 1, parallelisable)**
1. S1.1 Entity metadata parser
2. S1.3 Embedding adapter
3. S1.4 Graph analysis utilities
**Phase B — Validation & Output (Stage 1)**
4. S1.2 Schema compliance validator (needs S1.1)
5. S1.5 Structured evaluation output (needs S1.1)
6. S1.7 FCA computation (needs S1.1)
**Phase C — Orchestration (Stage 1 → Stage 2 bridge)**
7. S1.6 Batch LLM evaluation orchestrator (needs S1.5)
**Phase D — Infospace Core (Stage 2)**
8. S2.1 Infospace model and configuration
9. S2.2 Lifecycle commands
10. S2.3 Per-entity evaluation primitives (needs S1.6, S2.1)
**Phase E — Collection Intelligence (Stage 2)**
11. S2.4 Collection-level checks (needs S1.3, S1.4, S1.7, S2.1)
12. S2.5 Metrics history and viability tracking
**Phase F — Composition (Stage 2)**
13. S2.6 Infospace composition model
14. S2.7 Documentation
**Phase G — Example (Stage 3)**
15. S3.1 Migrate example to infospace config
16. S3.2 Clean per-chapter history
17. S3.3 Full evaluation run
18. S3.4 Rewrite tutorial
19. S3.5 Demonstrate composition