Free-tier APIs intermittently return invalid JSON or empty responses.
Now any exception in _call_llm retries up to 3 times with a 5s back-off,
rather than failing immediately on non-rate-limit errors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chapters with many pre-existing entities were still truncating at 6000 tokens
because the LLM needs space to output the full list of candidates even when
most are skipped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
map-to-vsm was consistently truncating at 6000 tokens; synthesize-analysis
sometimes truncated at 3000 for chapters with many entities.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- PipelineStage now supports max_tokens to override the 4096 default
- SourcePipeline records provider/model on each entity file as HTML comment
- output/processing-log.yaml tracks tokens, cost, duration, retries, errors
- _call_llm returns (content, metadata) for downstream traceability
- _http.py wraps JSON parse errors with body preview for debugging
- infospace.yaml stages: extract/map=6000 tokens, synthesize=3000 tokens
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SourcePipeline: retry split_entities stage once when 0 entity delimiters
are found (free-tier models intermittently return short non-formatted
responses); save raw LLM response to <stage>-raw.md alongside prompts
- Return None (pause pipeline) rather than writing empty view file when
no entities found after max retries
- _http.py: wrap json.JSONDecodeError in LLMAPIError with body preview
- extract-entities.md: add explicit H2-heading format example to Output
Format section to prevent models from using inline "Section:" format
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extend PipelineStage with name, output_dir, output_macro,
split_entities, and macros fields for declarative pipeline config
- Add SourcePipeline class (pipeline.py) using simple @{macro}
substitution — no SQLite dependency, skip-if-exists per stage,
LLM retry on rate limits, git commit per source
- Add `markitect infospace process [GLOB_PATTERN]` CLI command with
--all, --provider, --model, --check-after-each, --no-commit flags
- Update infospace.yaml with output_dir, output_macro, split_entities,
and macros for each pipeline stage in the WoN example
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes of metric fragmentation observed in collection checks:
1. Schema's Economic Domain used free-form examples ("labour economics,
trade theory") which overrode the enum in extraction-rules.md, causing
the LLM to produce multi-domain strings and non-canonical values.
Fix: schema now specifies the exact 7-value enum with descriptions.
2. Source Chapter had no format constraint, producing 9 different formats
for 7 chapters (full titles, mixed Roman/Arabic numerals, asterisks).
Fix: extraction-rules now mandate "Book [Roman], Chapter [n]" exactly.
These fixes are prerequisites for clean reprocessing (S3.2 continuation).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
History module with snapshot creation from check results, metrics file
I/O, auto-append to history after checks, date-based snapshot lookup,
and metric trend extraction. CLI commands: history, history-diff.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evaluation pipeline builds prompts from entity metadata, delegates
to BatchEvaluator, parses structured LLM responses into ScoreEntry
objects, and writes evaluation files. CLI: 'markitect infospace evaluate'
with --provider, --entity, --chapter filters.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds 'markitect infospace' command group with init (create config),
status (entity count/domains/disciplines), entities (list with sort),
and viability (threshold dashboard with pass/fail).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
InfospaceConfig (topic, disciplines, schemas, competency questions,
viability thresholds, pipeline) with YAML load/save and directory
discovery. InfospaceState aggregates entities, evaluations, and
viability checks for status reporting.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BatchEvaluator runs evaluation prompts across item batches with
incremental evaluation (skip unchanged via content digest), per-item
error isolation, progress callbacks, and aggregate token usage tracking.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pure-Python FCA implementation: FormalContext (entity × attribute
binary relation with extent/intent/closure), ConceptLattice via
NextClosure algorithm, find_gap_concepts() for structural coverage
gaps, and find_empty_cells() for cross-tabulation analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add data models (ScoreEntry, EntityEvaluation, EvaluationSnapshot,
SnapshotDiff) and I/O utilities for YAML frontmatter evaluation files,
snapshot persistence, history append, and snapshot diffing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add OpenAI-compatible embedding support (works with both OpenAI and
OpenRouter), file-based embedding cache with content-digest invalidation,
and pure-Python cosine similarity utilities for downstream redundancy
detection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deterministic validation of EntityMeta against declarative schemas:
section presence/word counts, heading format, domain enum values.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract section-tree algorithm from SchemaGenerator into standalone
core/section_tree.py and build markitect/infospace/ package with
EntityMeta dataclass and parse_entity_file/parse_entity_directory.
Foundation for schema compliance, coverage, and granularity metrics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Conceptual overview of infospaces as structured, evaluable, composable
knowledge collections. Establishes the vocabulary (topic, discipline,
entity, viability), the build cycle (extract, map, evaluate, refine),
the five collection quality concerns, and the composition model
(hierarchical, networked, swarm).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SQLite artifact database is a derived cache regenerable from
committed files — no LLM calls needed. Added tutorial section
explaining why it is excluded and how to rebuild it after a fresh clone.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
schema-generate now builds content-aware schemas from the document's
section hierarchy instead of counting markdown syntax elements. Detects
key-value tables, data tables, link lists, and mixed content patterns
to produce schemas that reflect the actual document outline.
Old behavior preserved via --mode syntactic. Validator and visualization
tools pinned to syntactic mode for compatibility.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>