Files
infospace-bench/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md

282 lines
9.5 KiB
Markdown

---
id: IB-WP-0015
type: workplan
title: "Generic Source Infospace Generator CLI"
domain: markitect
repo: infospace-bench
status: completed
owner: markitect
topic_slug: markitect
created: "2026-05-14"
updated: "2026-05-14"
state_hub_workstream_slug: "ib-wp-0015-generic-source-infospace-generator-cli"
state_hub_workstream_id: "1bf47fb9-fe55-428a-b8da-8e6cc76d4b03"
depends_on_workplans:
- IB-WP-0013
related_workplans:
- IB-WP-0014
---
# IB-WP-0015 - Generic Source Infospace Generator CLI
## Goal
Turn the Wealth/VSM example into a reusable CLI capability for incrementally
building an infospace from an ebook, article, or collection of knowledge.
When this workplan is done, a user should be able to run something like:
```bash
infospace-bench generate from-source ./examples/my-book.epub \
--slug my-book \
--name "My Book Infospace" \
--profile general-knowledge \
--provider openrouter \
--model <openrouter-model-id> \
--apply
```
and get a manifest-backed infospace with normalized sources, generated entity
artifacts, relation artifacts, evaluations, metrics/history, reports, and a
clear resume path.
## Intent
`IB-WP-0013` proved the successor shape on a one-chapter Wealth/VSM pilot. This
workplan generalizes that capability:
- source intake is generic, not Wealth-only
- workflow templates are reusable profiles
- assisted generation can use OpenRouter explicitly
- generation is incremental and resumable
- default tests stay deterministic and never require live provider credentials
The old infrastructure could generate the Adam Smith example with OpenRouter.
The new infrastructure should recover that operational convenience while
preserving the successor design: explicit workflows, auditable provider calls,
stable artifact IDs, and clean repo boundaries.
## Target CLI Shape
Suggested commands:
```bash
infospace-bench generate init <source> --slug <slug> --name <name> --profile <profile>
infospace-bench generate plan <root> --stage all
infospace-bench generate run <root> --provider openrouter --model <model-id> --stage all
infospace-bench generate resume <root> --provider openrouter --model <model-id>
infospace-bench generate status <root>
```
Short-form combined command:
```bash
infospace-bench generate from-source <source> \
--slug <slug> \
--name <name> \
--profile general-knowledge \
--provider openrouter \
--model <model-id> \
--apply
```
Default-safe modes:
- `--dry-run`: plan without provider calls or writes beyond optional plan output
- `--fixture-responses <path>`: deterministic tests and demos
- `--max-chunks <n>`: bound early runs
- `--stage intake|extract|relations|evaluate|metrics|all`
- `--resume`: skip completed chunks and retry failed or stale work
## Non-Goals
- Do not make live OpenRouter calls in the default test suite.
- Do not store API keys in `infospace.yaml`.
- Do not build a general document conversion product inside this repo.
- Do not hide provider calls behind implicit workflow execution.
- Do not solve remote storage backends here; `IB-WP-0014` owns backend
abstraction.
- Do not require full EPUB/PDF/article extraction perfection in the first pass;
extraction quality should be explicit and testable.
## Tasks
### T01 - Source intake and corpus normalization
```task
id: IB-WP-0015-T01
status: done
priority: high
state_hub_task_id: "08196bf2-9323-4cd8-860c-4306c965ed60"
```
- Add a source intake module that accepts files and folders
- Normalize supported inputs into markdown-ish source artifacts with stable IDs
- First supported source types:
- Markdown
- plain text
- local HTML/article export
- EPUB or ebook-like directory fixtures
- folder collections of the above
- Record source metadata: original path, source type, title, digest, chunk ID,
import time, and extractor version
- Add chunking for long inputs with deterministic chunk IDs
- Add tests for article, ebook, and folder fixtures
- Keep URL fetching optional and explicit; local fixtures must cover tests
### T02 - Generic workflow template pack and schema profiles
```task
id: IB-WP-0015-T02
status: done
priority: high
state_hub_task_id: "5604796b-cb09-43ed-b3a9-5d4906790807"
```
- Create reusable profile packs under a clear directory such as
`profiles/general-knowledge/`
- Include contracts for generated entities, relations, summaries, and
evaluations
- Include prompt templates for:
- source/chunk summary
- entity extraction
- relation extraction
- entity evaluation
- collection synthesis/reporting
- Let profiles define terminology, extraction granularity, evaluation criteria,
and optional lenses such as VSM
- Preserve the Wealth/VSM pilot as a specialized profile or example derived
from the generic path
### T03 - OpenRouter provider adapter and model configuration
```task
id: IB-WP-0015-T03
status: done
priority: high
state_hub_task_id: "c02720c5-1b82-458a-bf8c-9147af4fd9e9"
```
- Add an explicit OpenRouter assisted-generation adapter
- Read credentials from environment, preferably `OPENROUTER_API_KEY`
- Accept `--model <openrouter-model-id>` at the CLI boundary
- Record provider, model, request ID if available, timing, token usage if
available, retry count, and error detail in run records
- Add rate-limit and retry behavior that is visible and bounded
- Add model fallback support only when explicitly configured
- Keep fixture adapter support for deterministic tests
- Add provider contract tests with mocked HTTP, not live network calls
### T04 - Generator CLI orchestration
```task
id: IB-WP-0015-T04
status: done
priority: high
state_hub_task_id: "21b50fbc-f43e-4b18-b012-976a5241f52a"
```
- Add `infospace-bench generate ...` subcommands
- `generate init` creates an infospace from a source and selected profile
- `generate plan` shows chunk/stage/provider work without mutation
- `generate run` executes selected stages
- `generate resume` continues incomplete or failed work
- `generate status` reports source chunks, generated artifacts, failures,
stale outputs, evaluations, and metrics
- Support both stepwise and combined `from-source` flows
- Keep CLI output structured JSON by default, consistent with existing commands
- Ensure commands work with current local-folder backend and do not block
`IB-WP-0014`
### T05 - Incremental resume and stale output handling
```task
id: IB-WP-0015-T05
status: done
priority: high
state_hub_task_id: "ad882b6e-924e-4f9a-8e93-119aeadd8132"
```
- Track a generation state file under `output/workflows/` or an equivalent
successor location
- Record chunk digest, stage status, output artifact IDs, provider metadata,
errors, and timestamps
- Skip unchanged completed chunks by default
- Detect stale generated artifacts when source digests or profile/template
digests change
- Support rerun policies:
- failed only
- stale only
- force all
- selected chunk
- Add tests for interrupted generation, resume, stale detection, and idempotent
manifest updates
### T06 - End-to-end examples, docs, and acceptance suite
```task
id: IB-WP-0015-T06
status: done
priority: medium
state_hub_task_id: "3461eacf-e42a-455c-954c-849b0ad69fc1"
```
- Add deterministic end-to-end fixtures:
- one article
- one small ebook-like fixture
- one folder collection
- Prove each can generate an infospace with fixture responses
- Add an optional live OpenRouter smoke path that is skipped unless explicitly
enabled
- Document:
- how to choose a model
- where to put credentials
- how to cap chunks/cost
- how to resume
- how to review generated artifacts
- how to move from a generic profile to a specialized profile
- Update README and replacement docs with the new generator path
## Acceptance
- A user can generate a new infospace from a local article fixture using only
deterministic fixture responses
- A user can generate a new infospace from an ebook-like fixture using only
deterministic fixture responses
- A user can generate a new infospace from a folder collection using only
deterministic fixture responses
- A user can run the same CLI with `--provider openrouter --model <model-id>`
when `OPENROUTER_API_KEY` is configured
- Generated sources, chunks, entities, relations, evaluations, metrics, history,
and reports are manifest-backed and inspectable
- Generation is resumable and idempotent for unchanged inputs
- Stale outputs are detected when source or profile/template inputs change
- Live provider calls are explicit, auditable, and absent from default tests
## Relationship To Existing Work
- Builds on `IB-WP-0013`, which proved the explicit workflow shape for the
Wealth/VSM one-chapter pilot.
- Should stay compatible with `IB-WP-0014`, but should not wait for remote
backend support.
- Continues the successor split:
- `markitect-tool`: markdown parsing, templates, contracts
- `infospace-bench`: applied infospace generation workflow and CLI
- `kontextual-engine`: durable runtime/retrieval/audit if needed later
## Implementation Notes
Completed on 2026-05-14.
- Added generic source intake for Markdown, plain text, local HTML, EPUB-like
archives, and folder collections.
- Added the `general-knowledge` profile with prompt templates and contracts.
- Added an explicit OpenRouter assisted-generation adapter with mocked provider
tests and environment-based credential lookup.
- Added `infospace-bench generate` subcommands for init, plan, run, resume,
status, and from-source flows.
- Added generation state, resume skipping, source/profile stale detection,
metrics/history recording, and a manifest-backed generation report.
- Added deterministic acceptance tests for article, ebook-like, and folder
generation using fixture responses.