diff --git a/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md b/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md new file mode 100644 index 0000000..04c9ee9 --- /dev/null +++ b/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md @@ -0,0 +1,266 @@ +--- +id: IB-WP-0015 +type: workplan +title: "Generic Source Infospace Generator CLI" +domain: markitect +repo: infospace-bench +status: planned +owner: markitect +topic_slug: markitect +created: "2026-05-14" +updated: "2026-05-14" +state_hub_workstream_slug: "ib-wp-0015-generic-source-infospace-generator-cli" +state_hub_workstream_id: "1bf47fb9-fe55-428a-b8da-8e6cc76d4b03" +depends_on_workplans: + - IB-WP-0013 +related_workplans: + - IB-WP-0014 +--- + +# IB-WP-0015 - Generic Source Infospace Generator CLI + +## Goal + +Turn the Wealth/VSM example into a reusable CLI capability for incrementally +building an infospace from an ebook, article, or collection of knowledge. + +When this workplan is done, a user should be able to run something like: + +```bash +infospace-bench generate from-source ./examples/my-book.epub \ + --slug my-book \ + --name "My Book Infospace" \ + --profile general-knowledge \ + --provider openrouter \ + --model \ + --apply +``` + +and get a manifest-backed infospace with normalized sources, generated entity +artifacts, relation artifacts, evaluations, metrics/history, reports, and a +clear resume path. + +## Intent + +`IB-WP-0013` proved the successor shape on a one-chapter Wealth/VSM pilot. This +workplan generalizes that capability: + +- source intake is generic, not Wealth-only +- workflow templates are reusable profiles +- assisted generation can use OpenRouter explicitly +- generation is incremental and resumable +- default tests stay deterministic and never require live provider credentials + +The old infrastructure could generate the Adam Smith example with OpenRouter. +The new infrastructure should recover that operational convenience while +preserving the successor design: explicit workflows, auditable provider calls, +stable artifact IDs, and clean repo boundaries. + +## Target CLI Shape + +Suggested commands: + +```bash +infospace-bench generate init --slug --name --profile +infospace-bench generate plan --stage all +infospace-bench generate run --provider openrouter --model --stage all +infospace-bench generate resume --provider openrouter --model +infospace-bench generate status +``` + +Short-form combined command: + +```bash +infospace-bench generate from-source \ + --slug \ + --name \ + --profile general-knowledge \ + --provider openrouter \ + --model \ + --apply +``` + +Default-safe modes: + +- `--dry-run`: plan without provider calls or writes beyond optional plan output +- `--fixture-responses `: deterministic tests and demos +- `--max-chunks `: bound early runs +- `--stage intake|extract|relations|evaluate|metrics|all` +- `--resume`: skip completed chunks and retry failed or stale work + +## Non-Goals + +- Do not make live OpenRouter calls in the default test suite. +- Do not store API keys in `infospace.yaml`. +- Do not build a general document conversion product inside this repo. +- Do not hide provider calls behind implicit workflow execution. +- Do not solve remote storage backends here; `IB-WP-0014` owns backend + abstraction. +- Do not require full EPUB/PDF/article extraction perfection in the first pass; + extraction quality should be explicit and testable. + +## Tasks + +### T01 - Source intake and corpus normalization + +```task +id: IB-WP-0015-T01 +status: todo +priority: high +state_hub_task_id: "08196bf2-9323-4cd8-860c-4306c965ed60" +``` + +- Add a source intake module that accepts files and folders +- Normalize supported inputs into markdown-ish source artifacts with stable IDs +- First supported source types: + - Markdown + - plain text + - local HTML/article export + - EPUB or ebook-like directory fixtures + - folder collections of the above +- Record source metadata: original path, source type, title, digest, chunk ID, + import time, and extractor version +- Add chunking for long inputs with deterministic chunk IDs +- Add tests for article, ebook, and folder fixtures +- Keep URL fetching optional and explicit; local fixtures must cover tests + +### T02 - Generic workflow template pack and schema profiles + +```task +id: IB-WP-0015-T02 +status: todo +priority: high +state_hub_task_id: "5604796b-cb09-43ed-b3a9-5d4906790807" +``` + +- Create reusable profile packs under a clear directory such as + `profiles/general-knowledge/` +- Include contracts for generated entities, relations, summaries, and + evaluations +- Include prompt templates for: + - source/chunk summary + - entity extraction + - relation extraction + - entity evaluation + - collection synthesis/reporting +- Let profiles define terminology, extraction granularity, evaluation criteria, + and optional lenses such as VSM +- Preserve the Wealth/VSM pilot as a specialized profile or example derived + from the generic path + +### T03 - OpenRouter provider adapter and model configuration + +```task +id: IB-WP-0015-T03 +status: todo +priority: high +state_hub_task_id: "c02720c5-1b82-458a-bf8c-9147af4fd9e9" +``` + +- Add an explicit OpenRouter assisted-generation adapter +- Read credentials from environment, preferably `OPENROUTER_API_KEY` +- Accept `--model ` at the CLI boundary +- Record provider, model, request ID if available, timing, token usage if + available, retry count, and error detail in run records +- Add rate-limit and retry behavior that is visible and bounded +- Add model fallback support only when explicitly configured +- Keep fixture adapter support for deterministic tests +- Add provider contract tests with mocked HTTP, not live network calls + +### T04 - Generator CLI orchestration + +```task +id: IB-WP-0015-T04 +status: todo +priority: high +state_hub_task_id: "21b50fbc-f43e-4b18-b012-976a5241f52a" +``` + +- Add `infospace-bench generate ...` subcommands +- `generate init` creates an infospace from a source and selected profile +- `generate plan` shows chunk/stage/provider work without mutation +- `generate run` executes selected stages +- `generate resume` continues incomplete or failed work +- `generate status` reports source chunks, generated artifacts, failures, + stale outputs, evaluations, and metrics +- Support both stepwise and combined `from-source` flows +- Keep CLI output structured JSON by default, consistent with existing commands +- Ensure commands work with current local-folder backend and do not block + `IB-WP-0014` + +### T05 - Incremental resume and stale output handling + +```task +id: IB-WP-0015-T05 +status: todo +priority: high +state_hub_task_id: "ad882b6e-924e-4f9a-8e93-119aeadd8132" +``` + +- Track a generation state file under `output/workflows/` or an equivalent + successor location +- Record chunk digest, stage status, output artifact IDs, provider metadata, + errors, and timestamps +- Skip unchanged completed chunks by default +- Detect stale generated artifacts when source digests or profile/template + digests change +- Support rerun policies: + - failed only + - stale only + - force all + - selected chunk +- Add tests for interrupted generation, resume, stale detection, and idempotent + manifest updates + +### T06 - End-to-end examples, docs, and acceptance suite + +```task +id: IB-WP-0015-T06 +status: todo +priority: medium +state_hub_task_id: "3461eacf-e42a-455c-954c-849b0ad69fc1" +``` + +- Add deterministic end-to-end fixtures: + - one article + - one small ebook-like fixture + - one folder collection +- Prove each can generate an infospace with fixture responses +- Add an optional live OpenRouter smoke path that is skipped unless explicitly + enabled +- Document: + - how to choose a model + - where to put credentials + - how to cap chunks/cost + - how to resume + - how to review generated artifacts + - how to move from a generic profile to a specialized profile +- Update README and replacement docs with the new generator path + +## Acceptance + +- A user can generate a new infospace from a local article fixture using only + deterministic fixture responses +- A user can generate a new infospace from an ebook-like fixture using only + deterministic fixture responses +- A user can generate a new infospace from a folder collection using only + deterministic fixture responses +- A user can run the same CLI with `--provider openrouter --model ` + when `OPENROUTER_API_KEY` is configured +- Generated sources, chunks, entities, relations, evaluations, metrics, history, + and reports are manifest-backed and inspectable +- Generation is resumable and idempotent for unchanged inputs +- Stale outputs are detected when source or profile/template inputs change +- Live provider calls are explicit, auditable, and absent from default tests + +## Relationship To Existing Work + +- Builds on `IB-WP-0013`, which proved the explicit workflow shape for the + Wealth/VSM one-chapter pilot. +- Should stay compatible with `IB-WP-0014`, but should not wait for remote + backend support. +- Continues the successor split: + - `markitect-tool`: markdown parsing, templates, contracts + - `infospace-bench`: applied infospace generation workflow and CLI + - `kontextual-engine`: durable runtime/retrieval/audit if needed later +