diff --git a/.gitignore b/.gitignore index ee556450..104d4a82 100644 --- a/.gitignore +++ b/.gitignore @@ -78,6 +78,7 @@ Thumbs.db # MarkiTect database files (local development) markitect.db +**/infospace.db assets/assets.db **/assets.db .markitect/ diff --git a/examples/infospace-with-history/TUTORIAL.md b/examples/infospace-with-history/TUTORIAL.md index 53223b12..1c63fe6c 100644 --- a/examples/infospace-with-history/TUTORIAL.md +++ b/examples/infospace-with-history/TUTORIAL.md @@ -43,6 +43,7 @@ examples/infospace-with-history/ ├── TUTORIAL.md # This file ├── INFRA-TASKS.md # Infrastructure issues found during the experiment ├── process_chapters.py # Pipeline script +├── infospace.db # SQLite artifact database (generated, not in git) │ ├── schemas/ # Output structure definitions │ ├── economic-entity-schema-v1.0.md @@ -369,7 +370,53 @@ python process_chapters.py --stats --- -## 7. How the LLM Integration Works +## 7. The Artifact Database (`infospace.db`) + +The pipeline stores all artifacts (source text, templates, guidelines, generated +outputs) and their dependency edges in a local SQLite database — +`infospace.db`. This file is **not checked into git** because it is a derived +cache that can be regenerated deterministically from the files already in the +repository. + +### Why it is excluded + +- **Binary format** — SQLite databases don't produce meaningful diffs and + would bloat the git history with every pipeline run. +- **Fully derived** — every piece of data in the database originates from + markdown files that *are* tracked in git (sources, templates, schemas, + guidelines, and generated output). +- **Reproducible** — re-running the pipeline rebuilds the database from + scratch without any LLM calls, because each stage checks for existing + output files on disk before invoking the LLM. + +### How to regenerate it + +If `infospace.db` is missing (e.g. after a fresh clone), rebuild it by +re-running the pipeline over the chapters that already have output on disk: + +```bash +# Regenerate the database from existing output files (no LLM calls needed): +python process_chapters.py --all --no-commit +``` + +This will: + +1. Create a fresh `infospace.db` +2. Load all static artifacts (templates, guidelines, VSM reference) +3. For each chapter whose output files already exist, import them into the + database and record dependency edges +4. Skip LLM calls entirely — existing files are detected and reused + +After regeneration, `--list` and `--stats` work as normal: + +```bash +python process_chapters.py --list +python process_chapters.py --stats +``` + +--- + +## 8. How the LLM Integration Works The pipeline uses MarkiTect's `markitect.llm` module, which provides three adapter backends that implement the `LLMAdapter` interface: @@ -423,7 +470,7 @@ supports `gemini-2.5-flash` with generous rate limits. --- -## 8. Tracking History with Git +## 9. Tracking History with Git Every processed chapter produces a git commit containing: @@ -459,7 +506,7 @@ git commit -m "infospace: process book-1-chapter-05" --- -## 9. Cost and Performance +## 10. Cost and Performance From our measurements processing chapters 3-5: @@ -486,7 +533,7 @@ To reduce costs further, use a cheaper model: --- -## 10. Completing the Remaining Chapters +## 11. Completing the Remaining Chapters As of now, 5 of 35 chapters are processed (Book I, Chapters 1-5). Here is how to complete the rest. @@ -555,7 +602,7 @@ fill the remaining gaps in S3*, S5, and regulatory concepts. --- -## 11. Quality Improvement Loop +## 12. Quality Improvement Loop The infospace is designed to be **iteratively refined**: @@ -604,7 +651,7 @@ history of the infospace — every refinement decision is traceable. --- -## 12. Infrastructure Issues Found and Fixed +## 13. Infrastructure Issues Found and Fixed During development we documented three issues with the MarkiTect infrastructure in `INFRA-TASKS.md`: @@ -624,7 +671,7 @@ See `INFRA-TASKS.md` for details on each fix. --- -## 13. Adapting This Pattern to Your Own Project +## 14. Adapting This Pattern to Your Own Project To build your own infospace using this pattern: