Files
infospace-bench/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md

9.5 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_slug, state_hub_workstream_id, depends_on_workplans, related_workplans
id type title domain repo status owner topic_slug created updated state_hub_workstream_slug state_hub_workstream_id depends_on_workplans related_workplans
IB-WP-0015 workplan Generic Source Infospace Generator CLI markitect infospace-bench completed markitect markitect 2026-05-14 2026-05-14 ib-wp-0015-generic-source-infospace-generator-cli 1bf47fb9-fe55-428a-b8da-8e6cc76d4b03
IB-WP-0013
IB-WP-0014

IB-WP-0015 - Generic Source Infospace Generator CLI

Goal

Turn the Wealth/VSM example into a reusable CLI capability for incrementally building an infospace from an ebook, article, or collection of knowledge.

When this workplan is done, a user should be able to run something like:

infospace-bench generate from-source ./examples/my-book.epub \
  --slug my-book \
  --name "My Book Infospace" \
  --profile general-knowledge \
  --provider openrouter \
  --model <openrouter-model-id> \
  --apply

and get a manifest-backed infospace with normalized sources, generated entity artifacts, relation artifacts, evaluations, metrics/history, reports, and a clear resume path.

Intent

IB-WP-0013 proved the successor shape on a one-chapter Wealth/VSM pilot. This workplan generalizes that capability:

  • source intake is generic, not Wealth-only
  • workflow templates are reusable profiles
  • assisted generation can use OpenRouter explicitly
  • generation is incremental and resumable
  • default tests stay deterministic and never require live provider credentials

The old infrastructure could generate the Adam Smith example with OpenRouter. The new infrastructure should recover that operational convenience while preserving the successor design: explicit workflows, auditable provider calls, stable artifact IDs, and clean repo boundaries.

Target CLI Shape

Suggested commands:

infospace-bench generate init <source> --slug <slug> --name <name> --profile <profile>
infospace-bench generate plan <root> --stage all
infospace-bench generate run <root> --provider openrouter --model <model-id> --stage all
infospace-bench generate resume <root> --provider openrouter --model <model-id>
infospace-bench generate status <root>

Short-form combined command:

infospace-bench generate from-source <source> \
  --slug <slug> \
  --name <name> \
  --profile general-knowledge \
  --provider openrouter \
  --model <model-id> \
  --apply

Default-safe modes:

  • --dry-run: plan without provider calls or writes beyond optional plan output
  • --fixture-responses <path>: deterministic tests and demos
  • --max-chunks <n>: bound early runs
  • --stage intake|extract|relations|evaluate|metrics|all
  • --resume: skip completed chunks and retry failed or stale work

Non-Goals

  • Do not make live OpenRouter calls in the default test suite.
  • Do not store API keys in infospace.yaml.
  • Do not build a general document conversion product inside this repo.
  • Do not hide provider calls behind implicit workflow execution.
  • Do not solve remote storage backends here; IB-WP-0014 owns backend abstraction.
  • Do not require full EPUB/PDF/article extraction perfection in the first pass; extraction quality should be explicit and testable.

Tasks

T01 - Source intake and corpus normalization

id: IB-WP-0015-T01
status: done
priority: high
state_hub_task_id: "08196bf2-9323-4cd8-860c-4306c965ed60"
  • Add a source intake module that accepts files and folders
  • Normalize supported inputs into markdown-ish source artifacts with stable IDs
  • First supported source types:
    • Markdown
    • plain text
    • local HTML/article export
    • EPUB or ebook-like directory fixtures
    • folder collections of the above
  • Record source metadata: original path, source type, title, digest, chunk ID, import time, and extractor version
  • Add chunking for long inputs with deterministic chunk IDs
  • Add tests for article, ebook, and folder fixtures
  • Keep URL fetching optional and explicit; local fixtures must cover tests

T02 - Generic workflow template pack and schema profiles

id: IB-WP-0015-T02
status: done
priority: high
state_hub_task_id: "5604796b-cb09-43ed-b3a9-5d4906790807"
  • Create reusable profile packs under a clear directory such as profiles/general-knowledge/
  • Include contracts for generated entities, relations, summaries, and evaluations
  • Include prompt templates for:
    • source/chunk summary
    • entity extraction
    • relation extraction
    • entity evaluation
    • collection synthesis/reporting
  • Let profiles define terminology, extraction granularity, evaluation criteria, and optional lenses such as VSM
  • Preserve the Wealth/VSM pilot as a specialized profile or example derived from the generic path

T03 - OpenRouter provider adapter and model configuration

id: IB-WP-0015-T03
status: done
priority: high
state_hub_task_id: "c02720c5-1b82-458a-bf8c-9147af4fd9e9"
  • Add an explicit OpenRouter assisted-generation adapter
  • Read credentials from environment, preferably OPENROUTER_API_KEY
  • Accept --model <openrouter-model-id> at the CLI boundary
  • Record provider, model, request ID if available, timing, token usage if available, retry count, and error detail in run records
  • Add rate-limit and retry behavior that is visible and bounded
  • Add model fallback support only when explicitly configured
  • Keep fixture adapter support for deterministic tests
  • Add provider contract tests with mocked HTTP, not live network calls

T04 - Generator CLI orchestration

id: IB-WP-0015-T04
status: done
priority: high
state_hub_task_id: "21b50fbc-f43e-4b18-b012-976a5241f52a"
  • Add infospace-bench generate ... subcommands
  • generate init creates an infospace from a source and selected profile
  • generate plan shows chunk/stage/provider work without mutation
  • generate run executes selected stages
  • generate resume continues incomplete or failed work
  • generate status reports source chunks, generated artifacts, failures, stale outputs, evaluations, and metrics
  • Support both stepwise and combined from-source flows
  • Keep CLI output structured JSON by default, consistent with existing commands
  • Ensure commands work with current local-folder backend and do not block IB-WP-0014

T05 - Incremental resume and stale output handling

id: IB-WP-0015-T05
status: done
priority: high
state_hub_task_id: "ad882b6e-924e-4f9a-8e93-119aeadd8132"
  • Track a generation state file under output/workflows/ or an equivalent successor location
  • Record chunk digest, stage status, output artifact IDs, provider metadata, errors, and timestamps
  • Skip unchanged completed chunks by default
  • Detect stale generated artifacts when source digests or profile/template digests change
  • Support rerun policies:
    • failed only
    • stale only
    • force all
    • selected chunk
  • Add tests for interrupted generation, resume, stale detection, and idempotent manifest updates

T06 - End-to-end examples, docs, and acceptance suite

id: IB-WP-0015-T06
status: done
priority: medium
state_hub_task_id: "3461eacf-e42a-455c-954c-849b0ad69fc1"
  • Add deterministic end-to-end fixtures:
    • one article
    • one small ebook-like fixture
    • one folder collection
  • Prove each can generate an infospace with fixture responses
  • Add an optional live OpenRouter smoke path that is skipped unless explicitly enabled
  • Document:
    • how to choose a model
    • where to put credentials
    • how to cap chunks/cost
    • how to resume
    • how to review generated artifacts
    • how to move from a generic profile to a specialized profile
  • Update README and replacement docs with the new generator path

Acceptance

  • A user can generate a new infospace from a local article fixture using only deterministic fixture responses
  • A user can generate a new infospace from an ebook-like fixture using only deterministic fixture responses
  • A user can generate a new infospace from a folder collection using only deterministic fixture responses
  • A user can run the same CLI with --provider openrouter --model <model-id> when OPENROUTER_API_KEY is configured
  • Generated sources, chunks, entities, relations, evaluations, metrics, history, and reports are manifest-backed and inspectable
  • Generation is resumable and idempotent for unchanged inputs
  • Stale outputs are detected when source or profile/template inputs change
  • Live provider calls are explicit, auditable, and absent from default tests

Relationship To Existing Work

  • Builds on IB-WP-0013, which proved the explicit workflow shape for the Wealth/VSM one-chapter pilot.
  • Should stay compatible with IB-WP-0014, but should not wait for remote backend support.
  • Continues the successor split:
    • markitect-tool: markdown parsing, templates, contracts
    • infospace-bench: applied infospace generation workflow and CLI
    • kontextual-engine: durable runtime/retrieval/audit if needed later

Implementation Notes

Completed on 2026-05-14.

  • Added generic source intake for Markdown, plain text, local HTML, EPUB-like archives, and folder collections.
  • Added the general-knowledge profile with prompt templates and contracts.
  • Added an explicit OpenRouter assisted-generation adapter with mocked provider tests and environment-based credential lookup.
  • Added infospace-bench generate subcommands for init, plan, run, resume, status, and from-source flows.
  • Added generation state, resume skipping, source/profile stale detection, metrics/history recording, and a manifest-backed generation report.
  • Added deterministic acceptance tests for article, ebook-like, and folder generation using fixture responses.