generic source-to-infospace generator

2026-05-14 19:33:22 +02:00
parent 065e17f42e
commit 46aad3cce8
20 changed files with 1629 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -31,6 +31,7 @@ Start with:
 - `docs/legacy-infospace-migration-guide.md`
 - `docs/replacement-readiness-decision.md`
 - `docs/wealth-vsm-generation-pipeline.md`
 - `docs/generic-source-generator.md`
 - `infospaces/bootstrap-pilot/`
 - `infospaces/wealth-vsm-legacy-slice/`
 - `infospaces/wealth-vsm-generation-pilot/`
--- a/docs/generic-source-generator.md
+++ b/docs/generic-source-generator.md
@@ -0,0 +1,94 @@
 # Generic Source Generator
 Date: 2026-05-14
 ## Purpose
 `infospace-bench generate` turns a local article, ebook-like file, or folder of
 knowledge sources into a manifest-backed infospace. It generalizes the
 Wealth/VSM pilot into an explicit workflow path with deterministic fixture
 support and an optional OpenRouter provider.
 ## Deterministic Run
 Use fixture responses for repeatable tests and demos:
 ```bash
 infospace-bench generate from-source ./examples/article.md \
  --workspace . \
  --slug article-space \
  --name "Article Space" \
  --profile general-knowledge \
  --fixture-responses ./examples/responses.yaml \
  --apply
 ```
 The command creates normalized source chunks, installs the selected profile,
 runs the declared workflows, writes entities, relations, evaluations, metrics,
 history, and a generation report, then registers artifacts in
 `artifacts/index.yaml`.
 ## Stepwise Workflow
 ```bash
 infospace-bench generate init ./book.epub \
  --workspace . \
  --slug book-space \
  --name "Book Space" \
  --profile general-knowledge \
  --max-chunks 3
 infospace-bench generate plan ./infospaces/book-space --stage all
 infospace-bench generate run ./infospaces/book-space \
  --fixture-responses ./responses.yaml
 infospace-bench generate status ./infospaces/book-space
 ```
 `--max-chunks` caps early experiments and provider cost. `generate status`
 shows chunk counts, generated artifact counts, evaluations, metrics, history,
 and stale source/profile inputs.
 ## OpenRouter
 Live model calls are explicit:
 ```bash
 export OPENROUTER_API_KEY=...
 infospace-bench generate run ./infospaces/book-space \
  --provider openrouter \
  --model openai/gpt-4o-mini \
  --stage all
 ```
 Choose the `--model` value from OpenRouter model IDs. The API key is read from
 `OPENROUTER_API_KEY`; it is not written to `infospace.yaml`. Default tests never
 make live provider calls.
 ## Resume
 Use resume for interrupted or reviewed runs:
 ```bash
 infospace-bench generate resume ./infospaces/book-space \
  --provider openrouter \
  --model openai/gpt-4o-mini
 ```
 Unchanged completed runs are skipped. Use `--force` when you intentionally want
 to rerun completed work. Stale status is reported when source artifact digests
 or installed profile/template files change.
 ## Review Path
 After generation:
 - inspect `artifacts/sources/` for normalized input chunks
 - inspect `artifacts/entities/` and `artifacts/relations/` for generated claims
 - inspect `output/evaluations/` for rubric output
 - run `infospace-bench validate <root>` and `infospace-bench graph <root>`
 - review `reports/generation-summary.md`
 Move from the generic profile to a specialized profile when the source domain
 needs stricter terminology, narrower extraction granularity, or a discipline
 lens such as VSM.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,6 +11,9 @@ dependencies = [
 [project.scripts]
 infospace-bench = "infospace_bench.cli:main"
 [tool.setuptools.package-data]
 infospace_bench = ["profiles/**/*"]
 [tool.pytest.ini_options]
 pythonpath = ["src", "../markitect-tool/src"]
 testpaths = ["tests"]
--- a/src/infospace_bench/cli.py
+++ b/src/infospace_bench/cli.py
@@ -10,6 +10,12 @@ from .checks import run_collection_checks
 from .engine import engine_capability_contract, plan_asset_sync, sync_assets
 from .errors import InfospaceError
 from .evaluation_io import read_entity_evaluations
 from .generator import (
    init_generation_infospace,
    plan_generation,
    run_generation,
    status_generation,
 )
 from .history import (
    build_viability_report,
    find_snapshot,
@@ -123,6 +129,72 @@ def build_parser() -> argparse.ArgumentParser:
        help="Run assisted stages with deterministic fixture responses",
    )
    generate = sub.add_parser("generate", help="Generate infospaces from sources")
    generate_sub = generate.add_subparsers(dest="generate_command", required=True)
    generate_init = generate_sub.add_parser(
        "init",
        help="Create a generation infospace from a local source",
    )
    generate_init.add_argument("source")
    generate_init.add_argument("--workspace", default=".")
    generate_init.add_argument("--slug", required=True)
    generate_init.add_argument("--name", required=True)
    generate_init.add_argument("--profile", default="general-knowledge")
    generate_init.add_argument("--max-chunks", type=int, default=0)
    generate_plan = generate_sub.add_parser(
        "plan",
        help="Plan generator work without provider calls",
    )
    generate_plan.add_argument("root")
    generate_plan.add_argument("--stage", default="all")
    generate_run = generate_sub.add_parser(
        "run",
        help="Run generator workflows for an infospace",
    )
    generate_run.add_argument("root")
    generate_run.add_argument("--stage", default="all")
    generate_run.add_argument("--provider", choices=["fixture", "openrouter"], default="fixture")
    generate_run.add_argument("--model", default="")
    generate_run.add_argument("--fixture-responses", default="")
    generate_run.add_argument("--resume", action="store_true")
    generate_run.add_argument("--force", action="store_true")
    generate_resume = generate_sub.add_parser(
        "resume",
        help="Resume generator workflows for an infospace",
    )
    generate_resume.add_argument("root")
    generate_resume.add_argument("--stage", default="all")
    generate_resume.add_argument("--provider", choices=["fixture", "openrouter"], default="fixture")
    generate_resume.add_argument("--model", default="")
    generate_resume.add_argument("--fixture-responses", default="")
    generate_resume.add_argument("--force", action="store_true")
    generate_status = generate_sub.add_parser(
        "status",
        help="Inspect generator status for an infospace",
    )
    generate_status.add_argument("root")
    generate_from_source = generate_sub.add_parser(
        "from-source",
        help="Initialize and optionally run generation from a local source",
    )
    generate_from_source.add_argument("source")
    generate_from_source.add_argument("--workspace", default=".")
    generate_from_source.add_argument("--slug", required=True)
    generate_from_source.add_argument("--name", required=True)
    generate_from_source.add_argument("--profile", default="general-knowledge")
    generate_from_source.add_argument("--stage", default="all")
    generate_from_source.add_argument("--provider", choices=["fixture", "openrouter"], default="fixture")
    generate_from_source.add_argument("--model", default="")
    generate_from_source.add_argument("--fixture-responses", default="")
    generate_from_source.add_argument("--max-chunks", type=int, default=0)
    generate_from_source.add_argument("--apply", action="store_true")
    engine = sub.add_parser("engine", help="Inspect and sync engine boundary state")
    engine_sub = engine.add_subparsers(dest="engine_command", required=True)
@@ -284,6 +356,73 @@ def main(argv: list[str] | None = None) -> int:
                )
            else:
                parser.error(f"Unhandled workflow command: {args.workflow_command}")
        elif args.command == "generate":
            if args.generate_command == "init":
                infospace = init_generation_infospace(
                    Path(args.workspace),
                    Path(args.source),
                    args.slug,
                    name=args.name,
                    profile=args.profile,
                    max_chunks=_optional_positive(args.max_chunks),
                )
                _write_json(
                    {
                        "slug": infospace.config.slug,
                        "root": str(infospace.root),
                        "status": "initialized",
                    }
                )
            elif args.generate_command == "plan":
                _write_json(plan_generation(Path(args.root), stage=args.stage))
            elif args.generate_command == "run":
                _write_json(
                    run_generation(
                        Path(args.root),
                        stage=args.stage,
                        provider=args.provider,
                        model=args.model,
                        fixture_responses=args.fixture_responses or None,
                        resume=args.resume,
                        force=args.force,
                    ).to_dict()
                )
            elif args.generate_command == "resume":
                _write_json(
                    run_generation(
                        Path(args.root),
                        stage=args.stage,
                        provider=args.provider,
                        model=args.model,
                        fixture_responses=args.fixture_responses or None,
                        resume=True,
                        force=args.force,
                    ).to_dict()
                )
            elif args.generate_command == "status":
                _write_json(status_generation(Path(args.root)))
            elif args.generate_command == "from-source":
                infospace = init_generation_infospace(
                    Path(args.workspace),
                    Path(args.source),
                    args.slug,
                    name=args.name,
                    profile=args.profile,
                    max_chunks=_optional_positive(args.max_chunks),
                )
                if args.apply:
                    result = run_generation(
                        infospace.root,
                        stage=args.stage,
                        provider=args.provider,
                        model=args.model,
                        fixture_responses=args.fixture_responses or None,
                    )
                    _write_json(result.to_dict())
                else:
                    _write_json(plan_generation(infospace.root, stage=args.stage))
            else:
                parser.error(f"Unhandled generate command: {args.generate_command}")
        elif args.command == "engine":
            if args.engine_command == "inspect":
                _write_json(
@@ -377,3 +516,7 @@ def _relationship_summary_payload(summary) -> dict:
 def _write_json(payload: dict) -> None:
    print(json.dumps(payload, indent=2))
 def _optional_positive(value: int) -> int | None:
    return value if value > 0 else None
--- a/src/infospace_bench/generator.py
+++ b/src/infospace_bench/generator.py
@@ -0,0 +1,525 @@
 from __future__ import annotations
 import hashlib
 import shutil
 from dataclasses import asdict, dataclass, field
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any
 import yaml
 from .checks import run_collection_checks
 from .errors import InfospaceError
 from .evaluation_io import read_entity_evaluations
 from .history import get_history, read_metrics_file, record_check_results
 from .lifecycle import create_infospace, load_infospace, register_artifact
 from .openrouter import OpenRouterAssistedGenerationAdapter
 from .source_intake import SourceChunk, normalize_source
 from .workflow import (
    AssistedGenerationAdapter,
    FixtureAssistedGenerationAdapter,
    WorkflowRunResult,
    plan_workflow,
    run_workflow,
 )
 STATE_PATH = Path("output/workflows/generation-state.yaml")
 DEFAULT_PROFILE = "general-knowledge"
 WORKFLOW_BY_STAGE = {
    "summary": ["generic-source-summary"],
    "summarize": ["generic-source-summary"],
    "extract": ["generic-source-entities"],
    "entities": ["generic-source-entities"],
    "relations": ["generic-source-relations"],
    "evaluate": ["generic-source-evaluations"],
    "evaluation": ["generic-source-evaluations"],
    "all": [
        "generic-source-summary",
        "generic-source-entities",
        "generic-source-relations",
        "generic-source-evaluations",
    ],
 }
@dataclass(frozen=True)
 class GenerationRunResult:
    root: str
    status: str
    stage: str
    skipped: bool = False
    stale: bool = False
    workflows: list[dict[str, Any]] = field(default_factory=list)
    metrics: dict[str, Any] = field(default_factory=dict)
    history_snapshot_id: str = ""
    def to_dict(self) -> dict[str, Any]:
        data = asdict(self)
        return {key: value for key, value in data.items() if value not in ("", [], {})}
 def init_generation_infospace(
    workspace: str | Path,
    source: str | Path,
    slug: str,
    *,
    name: str,
    profile: str = DEFAULT_PROFILE,
    max_chunks: int | None = None,
 ) -> Any:
    chunks = normalize_source(source, max_chunks=max_chunks)
    infospace = create_infospace(Path(workspace), slug, name=name)
    _install_profile(infospace.root, profile)
    _write_workflows(infospace.root, profile)
    _register_source_chunks(infospace.root, chunks)
    _write_state(
        infospace.root,
        {
            "profile": profile,
            "source": str(Path(source)),
            "source_chunks": _source_state(infospace.root),
            "profile_digest": _profile_digest(infospace.root, profile),
            "stage_status": {},
            "completed": False,
            "created_at": _now(),
            "updated_at": _now(),
        },
    )
    return load_infospace(infospace.root)
 def plan_generation(root: str | Path, *, stage: str = "all") -> dict[str, Any]:
    root_path = Path(root)
    workflow_ids = _workflow_ids_for_stage(stage)
    plans: list[dict[str, Any]] = []
    for workflow_id in workflow_ids:
        try:
            plans.append(plan_workflow(root_path, workflow_id).to_dict())
        except InfospaceError as exc:
            plans.append(
                {
                    "workflow_id": workflow_id,
                    "status": "blocked",
                    "error": exc.to_dict(),
                }
            )
    status = status_generation(root_path)
    return {
        "root": str(root_path),
        "stage": stage,
        "status": "planned",
        "stale": status["stale"],
        "source_chunk_count": status["source_chunk_count"],
        "workflows": plans,
    }
 def run_generation(
    root: str | Path,
    *,
    stage: str = "all",
    provider: str = "fixture",
    model: str = "",
    fixture_responses: str | Path | None = None,
    resume: bool = False,
    force: bool = False,
 ) -> GenerationRunResult:
    root_path = Path(root)
    stage_key = stage.strip().lower()
    state = _read_state(root_path)
    status = status_generation(root_path)
    workflow_ids = _workflow_ids_for_stage(stage_key)
    if resume and not force and state.get("completed") is True and not status["stale"]:
        return GenerationRunResult(
            root=str(root_path),
            status="skipped",
            stage=stage,
            skipped=True,
            stale=False,
            workflows=[],
            metrics=status.get("metrics", {}),
        )
    adapter = (
        _adapter_for(provider, model=model, fixture_responses=fixture_responses)
        if workflow_ids
        else None
    )
    workflow_results: list[dict[str, Any]] = []
    for workflow_id in workflow_ids:
        result = run_workflow(root_path, workflow_id, assisted_adapter=adapter)
        workflow_results.append(result.to_dict())
        state = _mark_workflow_completed(state, result)
    metrics: dict[str, Any] = {}
    snapshot_id = ""
    if stage_key in {"all", "metrics"}:
        check_result = _record_metrics(root_path)
        metrics = check_result.metrics
        snapshot_id = check_result.snapshot.snapshot_id
        _write_generation_report(root_path, metrics, snapshot_id)
    state.update(
        {
            "source_chunks": _source_state(root_path),
            "profile_digest": _profile_digest(root_path, str(state.get("profile") or DEFAULT_PROFILE)),
            "completed": stage_key in {"all", "metrics"},
            "updated_at": _now(),
            "last_run": {
                "stage": stage,
                "provider": provider,
                "model": model,
                "workflow_count": len(workflow_results),
                "snapshot_id": snapshot_id,
                "completed_at": _now(),
            },
        }
    )
    _write_state(root_path, state)
    return GenerationRunResult(
        root=str(root_path),
        status="completed",
        stage=stage,
        skipped=False,
        stale=False,
        workflows=workflow_results,
        metrics=metrics,
        history_snapshot_id=snapshot_id,
    )
 def status_generation(root: str | Path) -> dict[str, Any]:
    root_path = Path(root)
    infospace = load_infospace(root_path)
    state = _read_state(root_path)
    stale_sources = _stale_source_ids(infospace.root)
    profile = str(state.get("profile") or DEFAULT_PROFILE)
    stale_profile = bool(
        state.get("profile_digest")
        and state.get("profile_digest") != _profile_digest(infospace.root, profile)
    )
    evaluations = read_entity_evaluations(infospace.root / "output" / "evaluations")
    history = get_history(infospace.root)
    return {
        "root": str(infospace.root),
        "slug": infospace.config.slug,
        "profile": profile,
        "source_chunk_count": sum(1 for item in infospace.artifacts if item.kind == "source"),
        "entity_count": sum(1 for item in infospace.artifacts if item.kind == "entity"),
        "relation_count": sum(1 for item in infospace.artifacts if item.kind == "relation"),
        "evaluation_count": len(evaluations),
        "generated_count": sum(1 for item in infospace.artifacts if item.kind == "generated"),
        "metrics": read_metrics_file(infospace.root / "output" / "metrics" / "metrics.yaml"),
        "history_snapshot_count": len(history),
        "latest_snapshot_id": history[-1].snapshot_id if history else "",
        "stale": bool(stale_sources or stale_profile),
        "stale_sources": stale_sources,
        "stale_profile": stale_profile,
        "completed": bool(state.get("completed", False)),
        "stage_status": state.get("stage_status", {}),
    }
 def _adapter_for(
    provider: str,
    *,
    model: str,
    fixture_responses: str | Path | None,
 ) -> AssistedGenerationAdapter:
    if fixture_responses:
        return FixtureAssistedGenerationAdapter.from_file(Path(fixture_responses))
    if provider == "openrouter":
        return OpenRouterAssistedGenerationAdapter(model=model)
    raise InfospaceError(
        "missing_assisted_generation_adapter",
        "Assisted generation requires --fixture-responses or --provider openrouter",
        {"provider": provider},
    )
 def _register_source_chunks(root: Path, chunks: list[SourceChunk]) -> None:
    for chunk in chunks:
        path = root / "artifacts" / "sources" / f"{chunk.chunk_id}.md"
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(chunk.markdown, encoding="utf-8")
        register_artifact(
            root,
            artifact_id=f"source/{chunk.chunk_id}.md",
            path=path,
            kind="source",
            title=chunk.title,
            provenance={
                "original_path": chunk.original_path,
                "source_type": chunk.source_type,
                "digest": chunk.digest,
                "chunk_id": chunk.chunk_id,
                "chunk_index": chunk.chunk_index,
                "chunk_count": chunk.chunk_count,
                "imported_at": chunk.imported_at,
                "extractor_version": chunk.extractor_version,
            },
        )
 def _install_profile(root: Path, profile: str) -> None:
    source = Path(__file__).parent / "profiles" / profile
    if not source.is_dir():
        raise InfospaceError(
            "missing_generation_profile",
            f"Generation profile does not exist: {profile}",
            {"profile": profile, "path": str(source)},
        )
    profile_target = root / "profiles" / profile
    template_target = root / "workflows" / "templates" / profile
    shutil.copytree(source, profile_target, dirs_exist_ok=True)
    shutil.copytree(source / "templates", template_target, dirs_exist_ok=True)
 def _write_workflows(root: Path, profile: str) -> None:
    config_path = root / "infospace.yaml"
    config = yaml.safe_load(config_path.read_text(encoding="utf-8")) or {}
    config["schemas"] = {
        **dict(config.get("schemas") or {}),
        "entity": f"profiles/{profile}/contracts/entity.contract.md",
        "relation": f"profiles/{profile}/contracts/relation.contract.md",
        "evaluation": f"profiles/{profile}/contracts/evaluation.contract.md",
    }
    config["workflows"] = _profile_workflows(profile)
    config_path.write_text(yaml.safe_dump(config, sort_keys=False), encoding="utf-8")
 def _profile_workflows(profile: str) -> list[dict[str, Any]]:
    base = f"workflows/templates/{profile}"
    return [
        {
            "id": "generic-source-summary",
            "description": "Summarize normalized source chunks.",
            "inputs": {"source": {"kind": "source"}},
            "static_macros": {"profile": profile},
            "stages": [
                {
                    "id": "summarize-source",
                    "kind": "assisted",
                    "input": "source",
                    "template": f"{base}/summarize-source.md",
                    "provider_hint": "openrouter",
                    "output": {
                        "path": "artifacts/generated/{{ input.slug }}-summary.md",
                        "artifact_id": "generated/{{ input.slug }}-summary.md",
                        "kind": "generated",
                        "title": "{{ input.title }} Summary",
                    },
                }
            ],
        },
        {
            "id": "generic-source-entities",
            "description": "Extract reusable entity artifacts from source chunks.",
            "inputs": {"source": {"kind": "source"}},
            "static_macros": {"profile": profile},
            "stages": [
                {
                    "id": "extract-entities",
                    "kind": "assisted",
                    "input": "source",
                    "template": f"{base}/extract-entities.md",
                    "provider_hint": "openrouter",
                    "output": {
                        "path": "artifacts/generated/{{ input.slug }}-entities.md",
                        "artifact_id": "generated/{{ input.slug }}-entities.md",
                        "kind": "generated",
                        "title": "{{ input.title }} Entity Bundle",
                    },
                },
                {
                    "id": "split-entities",
                    "kind": "split_entities",
                    "input": "source",
                    "template": "",
                    "static_macros": {"bundle_stage": "extract-entities"},
                },
            ],
        },
        {
            "id": "generic-source-relations",
            "description": "Extract relation artifacts from source chunks.",
            "inputs": {"source": {"kind": "source"}},
            "static_macros": {"profile": profile},
            "stages": [
                {
                    "id": "extract-relations",
                    "kind": "assisted",
                    "input": "source",
                    "template": f"{base}/extract-relations.md",
                    "provider_hint": "openrouter",
                    "output": {
                        "path": "artifacts/relations/{{ input.slug }}-relations.md",
                        "artifact_id": "relation/{{ input.slug }}-relations.md",
                        "kind": "relation",
                        "title": "{{ input.title }} Relations",
                    },
                }
            ],
        },
        {
            "id": "generic-source-evaluations",
            "description": "Evaluate generated entities with the profile rubric.",
            "inputs": {"entity": {"kind": "entity"}},
            "static_macros": {"profile": profile},
            "stages": [
                {
                    "id": "evaluate-entity",
                    "kind": "assisted",
                    "input": "entity",
                    "template": f"{base}/evaluate-entity.md",
                    "provider_hint": "openrouter",
                    "output": {
                        "path": "output/evaluations/{{ input.slug }}.md",
                        "artifact_id": "generated/evaluation-{{ input.slug }}.md",
                        "kind": "generated",
                        "title": "{{ input.title }} Evaluation",
                    },
                }
            ],
        },
    ]
 def _record_metrics(root: Path) -> Any:
    infospace = load_infospace(root)
    return record_check_results(
        infospace.root,
        run_collection_checks(infospace.artifacts),
        artifact_evaluations=read_entity_evaluations(infospace.root / "output" / "evaluations"),
        schema_name="generic-source",
        metadata={"generator": "generic-source"},
    )
 def _write_generation_report(root: Path, metrics: dict[str, Any], snapshot_id: str) -> None:
    status = status_generation(root)
    text = "\n".join(
        [
            "# Generation Report",
            "",
            f"Snapshot: {snapshot_id}",
            f"Sources: {status['source_chunk_count']}",
            f"Entities: {status['entity_count']}",
            f"Relations: {status['relation_count']}",
            f"Evaluations: {status['evaluation_count']}",
            "",
            "## Metrics",
            "",
            *[f"- {name}: {value}" for name, value in sorted(metrics.items())],
            "",
        ]
    )
    path = root / "reports" / "generation-summary.md"
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(text, encoding="utf-8")
    register_artifact(
        root,
        artifact_id="generated/generation-summary.md",
        path=path,
        kind="generated",
        title="Generation Summary",
        provenance={"workflow_id": "generic-source-generator", "snapshot_id": snapshot_id},
    )
 def _workflow_ids_for_stage(stage: str) -> list[str]:
    normalized = stage.strip().lower()
    if normalized == "intake":
        return []
    if normalized == "metrics":
        return []
    if normalized not in WORKFLOW_BY_STAGE:
        raise InfospaceError(
            "invalid_generation_stage",
            f"Unsupported generation stage: {stage}",
            {
                "stage": stage,
                "valid_stages": sorted([*WORKFLOW_BY_STAGE, "intake", "metrics"]),
            },
        )
    return WORKFLOW_BY_STAGE[normalized]
 def _source_state(root: Path) -> dict[str, Any]:
    infospace = load_infospace(root)
    return {
        item.id: {
            "path": item.path,
            "digest": item.provenance.get("digest", ""),
            "title": item.title,
            "source_type": item.provenance.get("source_type", ""),
            "chunk_id": item.provenance.get("chunk_id", ""),
        }
        for item in infospace.artifacts
        if item.kind == "source"
    }
 def _stale_source_ids(root: Path) -> list[str]:
    infospace = load_infospace(root)
    stale: list[str] = []
    for item in infospace.artifacts:
        if item.kind != "source":
            continue
        path = infospace.root / item.path
        expected = str(item.provenance.get("digest") or "")
        if not path.is_file() or (expected and _digest_text(path.read_text(encoding="utf-8")) != expected):
            stale.append(item.id)
    return stale
 def _mark_workflow_completed(
    state: dict[str, Any],
    result: WorkflowRunResult,
 ) -> dict[str, Any]:
    stage_status = dict(state.get("stage_status") or {})
    stage_status[result.workflow_id] = {
        "status": result.status,
        "run_id": result.run_id,
        "output_artifact_ids": [output.artifact_id for output in result.outputs],
        "updated_at": _now(),
    }
    return {**state, "stage_status": stage_status}
 def _profile_digest(root: Path, profile: str) -> str:
    files: list[Path] = []
    for base in (
        root / "profiles" / profile,
        root / "workflows" / "templates" / profile,
    ):
        if base.is_dir():
            files.extend(path for path in sorted(base.rglob("*")) if path.is_file())
    hasher = hashlib.sha256()
    for path in files:
        hasher.update(str(path.relative_to(root)).encode("utf-8"))
        hasher.update(path.read_bytes())
    return hasher.hexdigest()
 def _read_state(root: Path) -> dict[str, Any]:
    path = root / STATE_PATH
    if not path.is_file():
        return {}
    data = yaml.safe_load(path.read_text(encoding="utf-8"))
    return data if isinstance(data, dict) else {}
 def _write_state(root: Path, state: dict[str, Any]) -> None:
    path = root / STATE_PATH
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(yaml.safe_dump(state, sort_keys=False), encoding="utf-8")
 def _digest_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()
 def _now() -> str:
    return datetime.now(timezone.utc).isoformat()
--- a/src/infospace_bench/openrouter.py
+++ b/src/infospace_bench/openrouter.py
@@ -0,0 +1,142 @@
 from __future__ import annotations
 import json
 import os
 import time
 import urllib.error
 import urllib.request
 from dataclasses import dataclass
 from typing import Any, Callable
 from .errors import InfospaceError
 from .workflow import AssistedGenerationRequest, AssistedGenerationResult
 OPENROUTER_ENDPOINT = "https://openrouter.ai/api/v1/chat/completions"
 Transport = Callable[[dict[str, Any], dict[str, str], str], dict[str, Any]]
@dataclass(frozen=True)
 class OpenRouterAssistedGenerationAdapter:
    model: str
    api_key: str = ""
    endpoint: str = OPENROUTER_ENDPOINT
    transport: Transport | None = None
    retry_limit: int = 2
    timeout_seconds: float = 60.0
    def __post_init__(self) -> None:
        key = self.api_key or os.environ.get("OPENROUTER_API_KEY", "")
        if not key:
            raise InfospaceError(
                "missing_openrouter_api_key",
                "OPENROUTER_API_KEY is required for the OpenRouter provider",
                {"env": "OPENROUTER_API_KEY"},
            )
        object.__setattr__(self, "api_key", key)
        if not self.model:
            raise InfospaceError(
                "missing_openrouter_model",
                "OpenRouter provider requires an explicit model",
                {"option": "--model"},
            )
    def generate(
        self,
        request: AssistedGenerationRequest,
    ) -> AssistedGenerationResult:
        payload = {
            "model": self.model,
            "messages": [
                {
                    "role": "system",
                    "content": (
                        "Return concise, valid Markdown only. Preserve explicit "
                        "contracts requested in the user prompt."
                    ),
                },
                {"role": "user", "content": request.prompt},
            ],
            "metadata": {
                "workflow_id": request.workflow_id,
                "stage_id": request.stage_id,
                "input_artifact_id": request.input_artifact_id,
            },
        }
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "HTTP-Referer": "https://github.com/markitect/infospace-bench",
            "X-Title": "infospace-bench",
        }
        started = time.monotonic()
        retry_count = 0
        last_error = ""
        while True:
            try:
                response = (
                    self.transport(payload, headers, self.endpoint)
                    if self.transport is not None
                    else self._default_transport(payload, headers, self.endpoint)
                )
                choice = (response.get("choices") or [{}])[0]
                message = choice.get("message") or {}
                markdown = str(message.get("content") or "")
                if not markdown:
                    raise InfospaceError(
                        "empty_openrouter_response",
                        "OpenRouter returned an empty assistant response",
                        {"model": self.model, "response_id": response.get("id")},
                    )
                return AssistedGenerationResult(
                    markdown=markdown,
                    provider="openrouter",
                    metadata={
                        "model": self.model,
                        "request_id": str(response.get("id") or ""),
                        "usage": response.get("usage") or {},
                        "retry_count": retry_count,
                        "duration_seconds": round(time.monotonic() - started, 3),
                    },
                )
            except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError) as exc:
                last_error = str(exc)
            except InfospaceError:
                raise
            except Exception as exc:  # pragma: no cover - defensive provider boundary
                last_error = str(exc)
            if retry_count >= self.retry_limit:
                raise InfospaceError(
                    "openrouter_request_failed",
                    "OpenRouter request failed after bounded retries",
                    {
                        "model": self.model,
                        "retry_count": retry_count,
                        "error": last_error,
                    },
                )
            retry_count += 1
            time.sleep(min(2**retry_count, 8))
    def _default_transport(
        self,
        payload: dict[str, Any],
        headers: dict[str, str],
        endpoint: str,
    ) -> dict[str, Any]:
        request = urllib.request.Request(
            endpoint,
            data=json.dumps(payload).encode("utf-8"),
            headers=headers,
            method="POST",
        )
        with urllib.request.urlopen(request, timeout=self.timeout_seconds) as response:
            data = response.read().decode("utf-8")
        parsed = json.loads(data)
        if not isinstance(parsed, dict):
            raise InfospaceError(
                "invalid_openrouter_response",
                "OpenRouter returned a non-object JSON response",
                {"model": self.model},
            )
        return parsed
--- a/src/infospace_bench/profiles/general-knowledge/contracts/entity.contract.md
+++ b/src/infospace_bench/profiles/general-knowledge/contracts/entity.contract.md
@@ -0,0 +1,9 @@
 # Entity Contract
 Each generated entity must be a Markdown artifact with:
 - one top-level heading containing the entity title
 - a `## Definition` section
 - optional `## Context`, `## Source Evidence`, and `## Review Notes` sections
 Entity titles should be stable, short, and reusable across source chunks.
--- a/src/infospace_bench/profiles/general-knowledge/contracts/evaluation.contract.md
+++ b/src/infospace_bench/profiles/general-knowledge/contracts/evaluation.contract.md
@@ -0,0 +1,10 @@
 # Evaluation Contract
 Each evaluation must be Markdown with YAML frontmatter containing:
 - `artifact_id`
 - `evaluator`
 - `evaluated_at`
 - `scores`
 Scores should include groundedness and usefulness on a 0 to 5 scale.
--- a/src/infospace_bench/profiles/general-knowledge/contracts/relation.contract.md
+++ b/src/infospace_bench/profiles/general-knowledge/contracts/relation.contract.md
@@ -0,0 +1,11 @@
 # Relation Contract
 Each generated relation must be a Markdown artifact with:
 - one top-level heading containing the relation title
 - `## Subject`
 - `## Predicate`
 - `## Object`
 - optional `## Relation Type`, `## Evidence`, and `## Feedback Role`
 Subject and object values should match generated entity titles whenever possible.
--- a/src/infospace_bench/profiles/general-knowledge/contracts/summary.contract.md
+++ b/src/infospace_bench/profiles/general-knowledge/contracts/summary.contract.md
@@ -0,0 +1,7 @@
 # Summary Contract
 Each source summary should preserve:
 - the core claims or concepts
 - evidence phrases useful for later review
 - unresolved ambiguities or extraction risks
--- a/src/infospace_bench/profiles/general-knowledge/profile.yaml
+++ b/src/infospace_bench/profiles/general-knowledge/profile.yaml
@@ -0,0 +1,14 @@
 id: general-knowledge
 name: General Knowledge
 description: Generic infospace generation profile for local articles, ebooks, and knowledge collections.
 terminology:
  source_chunk: Normalized source artifact
  entity: Durable concept, claim, method, person, place, work, or object
  relation: Typed link between two generated entities
 granularity:
  default: Extract entities that can stand alone as useful review artifacts.
 evaluation_criteria:
  - groundedness
  - usefulness
  - clarity
  - provenance
--- a/src/infospace_bench/profiles/general-knowledge/templates/evaluate-entity.md
+++ b/src/infospace_bench/profiles/general-knowledge/templates/evaluate-entity.md
@@ -0,0 +1,14 @@
 # Evaluate Entity
 Profile: {{ macros.profile }}
 Evaluate the generated entity as Markdown with YAML frontmatter. Include
 `artifact_id`, `evaluator`, `evaluated_at`, and scores for groundedness and
 usefulness on a 0 to 5 scale.
 Entity artifact: {{ input.artifact_id }}
 Entity title: {{ input.title }}
 ## Entity
 {{ input.content }}
--- a/src/infospace_bench/profiles/general-knowledge/templates/extract-entities.md
+++ b/src/infospace_bench/profiles/general-knowledge/templates/extract-entities.md
@@ -0,0 +1,15 @@
 # Extract Entities
 Profile: {{ macros.profile }}
 Extract reusable infospace entities from the source chunk. Return one Markdown
 bundle where each entity starts with `# Entity Title` and contains at least a
 `## Definition` section. Prefer durable concepts, claims, named methods,
 people, places, works, and objects over sentence fragments.
 Source title: {{ input.title }}
 Source artifact: {{ input.artifact_id }}
 ## Source
 {{ input.content }}
--- a/src/infospace_bench/profiles/general-knowledge/templates/extract-relations.md
+++ b/src/infospace_bench/profiles/general-knowledge/templates/extract-relations.md
@@ -0,0 +1,14 @@
 # Extract Relations
 Profile: {{ macros.profile }}
 Extract a small set of important relations from the source chunk. Return one
 Markdown relation artifact with sections `## Subject`, `## Predicate`, and
 `## Object`. Use entity-style names for subject and object.
 Source title: {{ input.title }}
 Source artifact: {{ input.artifact_id }}
 ## Source
 {{ input.content }}
--- a/src/infospace_bench/profiles/general-knowledge/templates/summarize-source.md
+++ b/src/infospace_bench/profiles/general-knowledge/templates/summarize-source.md
@@ -0,0 +1,13 @@
 # Summarize Source Chunk
 Profile: {{ macros.profile }}
 Summarize the source chunk as Markdown. Preserve concrete claims, named concepts,
 and evidence phrases that should guide later entity and relation extraction.
 Source title: {{ input.title }}
 Source artifact: {{ input.artifact_id }}
 ## Source
 {{ input.content }}
--- a/src/infospace_bench/profiles/general-knowledge/templates/synthesize-report.md
+++ b/src/infospace_bench/profiles/general-knowledge/templates/synthesize-report.md
@@ -0,0 +1,6 @@
 # Synthesize Collection Report
 Profile: {{ macros.profile }}
 Synthesize a concise report from generated source summaries, entities,
 relations, evaluations, and collection metrics.
--- a/src/infospace_bench/source_intake.py
+++ b/src/infospace_bench/source_intake.py
@@ -0,0 +1,273 @@
 from __future__ import annotations
 import hashlib
 import html
 import re
 import zipfile
 from dataclasses import asdict, dataclass
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Iterable
 from .errors import InfospaceError
 from .semantics import slugify
 EXTRACTOR_VERSION = "generic-source-intake-v1"
 SUPPORTED_EXTENSIONS = {".md", ".markdown", ".txt", ".html", ".htm", ".epub"}
 HTML_TITLE_RE = re.compile(r"<title[^>]*>(?P<title>.*?)</title>", re.I | re.S)
 HTML_H1_RE = re.compile(r"<h1[^>]*>(?P<title>.*?)</h1>", re.I | re.S)
 SCRIPT_STYLE_RE = re.compile(r"<(script|style)[^>]*>.*?</\1>", re.I | re.S)
 TAG_RE = re.compile(r"<[^>]+>")
@dataclass(frozen=True)
 class SourceChunk:
    chunk_id: str
    title: str
    markdown: str
    source_type: str
    original_path: str
    digest: str
    chunk_index: int
    chunk_count: int
    imported_at: str
    extractor_version: str = EXTRACTOR_VERSION
    def to_dict(self) -> dict:
        return asdict(self)
@dataclass(frozen=True)
 class _SourceDocument:
    title: str
    markdown: str
    source_type: str
    original_path: str
    base_slug: str
 def normalize_source(
    source: str | Path,
    *,
    max_words: int = 800,
    max_chunks: int | None = None,
 ) -> list[SourceChunk]:
    source_path = Path(source)
    if not source_path.exists():
        raise InfospaceError(
            "missing_source",
            f"Source path does not exist: {source_path}",
            {"source": str(source_path)},
        )
    documents = list(_iter_documents(source_path))
    if not documents:
        raise InfospaceError(
            "unsupported_source",
            f"No supported source documents found: {source_path}",
            {
                "source": str(source_path),
                "supported_extensions": sorted(SUPPORTED_EXTENSIONS),
            },
        )
    imported_at = datetime.now(timezone.utc).isoformat()
    chunks: list[SourceChunk] = []
    used_ids: set[str] = set()
    for document in documents:
        pieces = _chunk_markdown(document.markdown, max_words=max_words)
        for index, piece in enumerate(pieces):
            title = document.title if len(pieces) == 1 else f"{document.title} Part {index + 1}"
            base_id = (
                document.base_slug if len(pieces) == 1 else f"{document.base_slug}-part-{index + 1:03d}"
            )
            chunk_id = _dedupe_chunk_id(base_id, used_ids)
            chunks.append(
                SourceChunk(
                    chunk_id=chunk_id,
                    title=title,
                    markdown=piece,
                    source_type=document.source_type,
                    original_path=document.original_path,
                    digest=_digest_text(piece),
                    chunk_index=index,
                    chunk_count=len(pieces),
                    imported_at=imported_at,
                )
            )
            if max_chunks is not None and max_chunks > 0 and len(chunks) >= max_chunks:
                return chunks
    return chunks
 def _iter_documents(source_path: Path) -> Iterable[_SourceDocument]:
    if source_path.is_dir():
        for path in sorted(source_path.rglob("*")):
            if path.is_file() and path.suffix.lower() in SUPPORTED_EXTENSIONS:
                yield from _iter_documents(path)
        return
    suffix = source_path.suffix.lower()
    if suffix in (".md", ".markdown"):
        yield _markdown_document(source_path)
    elif suffix == ".txt":
        yield _text_document(source_path)
    elif suffix in (".html", ".htm"):
        yield _html_document(source_path, source_type="html")
    elif suffix == ".epub":
        yield from _epub_documents(source_path)
 def _markdown_document(path: Path) -> _SourceDocument:
    markdown = _normalize_newlines(path.read_text(encoding="utf-8")).strip() + "\n"
    title = _markdown_title(markdown) or _title_from_path(path)
    return _SourceDocument(
        title=title,
        markdown=_ensure_h1(markdown, title),
        source_type="markdown",
        original_path=str(path),
        base_slug=slugify(title) or slugify(path.stem) or "source",
    )
 def _text_document(path: Path) -> _SourceDocument:
    title = _title_from_path(path)
    body = _normalize_newlines(path.read_text(encoding="utf-8")).strip()
    markdown = f"# {title}\n\n{body}\n"
    return _SourceDocument(
        title=title,
        markdown=markdown,
        source_type="text",
        original_path=str(path),
        base_slug=slugify(title) or "source",
    )
 def _html_document(
    path: Path,
    *,
    source_type: str,
    original_path: str | None = None,
    text: str | None = None,
 ) -> _SourceDocument:
    raw = text if text is not None else path.read_text(encoding="utf-8")
    title = _html_title(raw) or _title_from_path(path)
    body = _html_to_text(raw)
    if body.lower().startswith(title.lower()):
        body = body[len(title) :].strip()
    markdown = f"# {title}\n\n{body}\n"
    return _SourceDocument(
        title=title,
        markdown=markdown,
        source_type=source_type,
        original_path=original_path or str(path),
        base_slug=slugify(title) or slugify(path.stem) or "source",
    )
 def _epub_documents(path: Path) -> Iterable[_SourceDocument]:
    try:
        with zipfile.ZipFile(path) as archive:
            names = [
                name
                for name in sorted(archive.namelist())
                if Path(name).suffix.lower() in {".html", ".htm", ".xhtml", ".txt", ".md"}
                and not name.endswith("/")
            ]
            for name in names:
                raw = archive.read(name).decode("utf-8", errors="replace")
                pseudo_path = Path(name)
                if pseudo_path.suffix.lower() in {".txt", ".md"}:
                    title = _markdown_title(raw) or _title_from_path(pseudo_path)
                    markdown = _ensure_h1(_normalize_newlines(raw).strip() + "\n", title)
                    yield _SourceDocument(
                        title=title,
                        markdown=markdown,
                        source_type="epub",
                        original_path=f"{path}!{name}",
                        base_slug=slugify(title) or slugify(pseudo_path.stem) or "source",
                    )
                else:
                    yield _html_document(
                        pseudo_path,
                        source_type="epub",
                        original_path=f"{path}!{name}",
                        text=raw,
                    )
    except zipfile.BadZipFile as exc:
        raise InfospaceError(
            "invalid_epub_source",
            f"EPUB source is not a readable zip archive: {path}",
            {"source": str(path)},
        ) from exc
 def _chunk_markdown(markdown: str, *, max_words: int) -> list[str]:
    text = markdown.strip()
    if max_words <= 0:
        return [text + "\n"]
    words = text.split()
    if len(words) <= max_words:
        return [text + "\n"]
    chunks: list[str] = []
    heading = _markdown_title(text) or "Source"
    body_words = re.sub(r"(?m)^# .+?\n+", "", text, count=1).split()
    for start in range(0, len(body_words), max_words):
        part = " ".join(body_words[start : start + max_words]).strip()
        chunks.append(f"# {heading} Part {len(chunks) + 1}\n\n{part}\n")
    return chunks
 def _html_title(raw: str) -> str:
    match = HTML_TITLE_RE.search(raw) or HTML_H1_RE.search(raw)
    if not match:
        return ""
    return _collapse_ws(_html_to_text(match.group("title")))
 def _html_to_text(raw: str) -> str:
    cleaned = SCRIPT_STYLE_RE.sub(" ", raw)
    cleaned = re.sub(r"</(p|div|section|article|h[1-6]|li)>", "\n", cleaned, flags=re.I)
    cleaned = TAG_RE.sub(" ", cleaned)
    cleaned = html.unescape(cleaned)
    lines = [_collapse_ws(line) for line in cleaned.splitlines()]
    return "\n\n".join(line for line in lines if line).strip()
 def _ensure_h1(markdown: str, title: str) -> str:
    if re.search(r"(?m)^#\s+\S", markdown):
        return markdown
    return f"# {title}\n\n{markdown.strip()}\n"
 def _markdown_title(markdown: str) -> str:
    match = re.search(r"(?m)^#\s+(?P<title>.+?)\s*$", markdown)
    return match.group("title").strip() if match else ""
 def _title_from_path(path: Path) -> str:
    words = re.sub(r"[^A-Za-z0-9]+", " ", path.stem).strip()
    return words.title() if words else "Source"
 def _dedupe_chunk_id(base_id: str, used_ids: set[str]) -> str:
    candidate = base_id or "source"
    if candidate not in used_ids:
        used_ids.add(candidate)
        return candidate
    index = 2
    while f"{candidate}-{index}" in used_ids:
        index += 1
    deduped = f"{candidate}-{index}"
    used_ids.add(deduped)
    return deduped
 def _digest_text(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()
 def _collapse_ws(value: str) -> str:
    return re.sub(r"\s+", " ", value).strip()
 def _normalize_newlines(value: str) -> str:
    return value.replace("\r\n", "\n").replace("\r", "\n")
--- a/src/infospace_bench/workflow.py
+++ b/src/infospace_bench/workflow.py
@@ -273,10 +273,12 @@ class WorkflowStageRecord:
    input_artifact_id: str
    output_artifact_id: str = ""
    message: str = ""
    provider: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)
    def to_dict(self) -> dict[str, Any]:
        data = asdict(self)
-        return {key: value for key, value in data.items() if value != ""}
+        return {key: value for key, value in data.items() if value not in ("", {}, [])}
@dataclass(frozen=True)
@@ -442,6 +444,7 @@ def _execute_workflow(
                    infospace.root,
                    dry_run=False,
                    provider=result.provider,
                    provider_metadata=result.metadata,
                )
                outputs.append(output)
                stage_outputs[stage.id] = {
@@ -458,6 +461,8 @@ def _execute_workflow(
                        status="completed",
                        input_artifact_id=input_record.artifact_id,
                        output_artifact_id=output.artifact_id,
                        provider=result.provider,
                        metadata=result.metadata,
                    )
                )
            elif stage.kind == "split_entities":
@@ -645,6 +650,7 @@ def _resolve_output(
    *,
    dry_run: bool,
    provider: str = "",
    provider_metadata: dict[str, Any] | None = None,
 ) -> WorkflowOutputRecord:
    if stage.output is None:
        raise InfospaceError(
@@ -673,6 +679,11 @@ def _resolve_output(
                    "stage_id": stage.id,
                    "input_artifact_id": input_record.artifact_id,
                    **({"provider": provider} if provider else {}),
                    **(
                        {"provider_metadata": provider_metadata}
                        if provider_metadata
                        else {}
                    ),
                },
                relationships=[
                    {
--- a/tests/test_generic_generator.py
+++ b/tests/test_generic_generator.py
@@ -0,0 +1,301 @@
 import json
 import os
 import subprocess
 import sys
 import zipfile
 from pathlib import Path
 import yaml
 from infospace_bench.generator import (
    init_generation_infospace,
    run_generation,
    status_generation,
 )
 from infospace_bench.openrouter import OpenRouterAssistedGenerationAdapter
 from infospace_bench.source_intake import normalize_source
 def cli_env() -> dict[str, str]:
    env = os.environ.copy()
    env["PYTHONPATH"] = "src:/home/worsch/markitect-tool/src"
    return env
 def fixture_responses(path: Path) -> None:
    data = {
        "responses": [
            {
                "stage_id": "summarize-source",
                "input_artifact_id": "*",
                "markdown": "# Source Summary\n\nThe source describes reusable knowledge work.\n",
            },
            {
                "stage_id": "extract-entities",
                "input_artifact_id": "*",
                "markdown": (
                    "# Knowledge Artifact\n\n"
                    "## Definition\n\n"
                    "A durable unit of structured knowledge derived from a source.\n\n"
                    "## Context\n\n"
                    "Generated from a generic source workflow.\n\n"
                    "# Source Claim\n\n"
                    "## Definition\n\n"
                    "A claim preserved from the source for later review.\n\n"
                    "## Context\n\n"
                    "Used to keep provenance visible.\n"
                ),
            },
            {
                "stage_id": "extract-relations",
                "input_artifact_id": "*",
                "markdown": (
                    "# Knowledge Artifact Supports Source Claim\n\n"
                    "## Subject\n\n"
                    "Knowledge Artifact\n\n"
                    "## Predicate\n\n"
                    "supports\n\n"
                    "## Object\n\n"
                    "Source Claim\n\n"
                    "## Relation Type\n\n"
                    "support\n\n"
                    "## Evidence\n\n"
                    "The source links durable artifacts to explicit claims.\n"
                ),
            },
            {
                "stage_id": "evaluate-entity",
                "input_artifact_id": "*",
                "markdown": (
                    "---\n"
                    "artifact_id: entity/knowledge-artifact.md\n"
                    "evaluator: fixture\n"
                    "evaluated_at: '2026-05-14T00:00:00'\n"
                    "scores:\n"
                    "  - name: groundedness\n"
                    "    value: 4.0\n"
                    "    max_value: 5.0\n"
                    "  - name: usefulness\n"
                    "    value: 4.0\n"
                    "    max_value: 5.0\n"
                    "---\n"
                    "\n"
                    "# Evaluation: entity/knowledge-artifact.md\n"
                ),
            },
        ]
    }
    path.write_text(yaml.safe_dump(data, sort_keys=False), encoding="utf-8")
 def write_epub_fixture(path: Path) -> None:
    with zipfile.ZipFile(path, "w") as archive:
        archive.writestr("OEBPS/chapter1.xhtml", "<h1>Chapter One</h1><p>Alpha beta.</p>")
        archive.writestr("OEBPS/chapter2.xhtml", "<h1>Chapter Two</h1><p>Gamma delta.</p>")
 def test_source_intake_accepts_article_ebook_and_folder(tmp_path: Path) -> None:
    article = tmp_path / "article.html"
    article.write_text(
        "<html><head><title>Article Title</title></head>"
        "<body><h1>Article Title</h1><p>One two three.</p></body></html>",
        encoding="utf-8",
    )
    ebook = tmp_path / "book.epub"
    write_epub_fixture(ebook)
    folder = tmp_path / "collection"
    folder.mkdir()
    (folder / "note.md").write_text("# Note\n\nMarkdown source.", encoding="utf-8")
    (folder / "memo.txt").write_text("Plain text source.", encoding="utf-8")
    article_chunks = normalize_source(article)
    ebook_chunks = normalize_source(ebook)
    folder_chunks = normalize_source(folder)
    assert article_chunks[0].source_type == "html"
    assert article_chunks[0].title == "Article Title"
    assert article_chunks[0].chunk_id == "article-title"
    assert article_chunks[0].digest == normalize_source(article)[0].digest
    assert [chunk.source_type for chunk in ebook_chunks] == ["epub", "epub"]
    assert {chunk.source_type for chunk in folder_chunks} == {"markdown", "text"}
    assert all(chunk.markdown.startswith("# ") for chunk in folder_chunks)
 def test_generate_from_source_cli_fixture_builds_infospace(tmp_path: Path) -> None:
    source = tmp_path / "article.md"
    source.write_text(
        "# Reusable Knowledge\n\nA source about claims and durable artifacts.",
        encoding="utf-8",
    )
    fixture = tmp_path / "responses.yaml"
    fixture_responses(fixture)
    result = subprocess.run(
        [
            sys.executable,
            "-m",
            "infospace_bench",
            "generate",
            "from-source",
            str(source),
            "--workspace",
            str(tmp_path),
            "--slug",
            "article-space",
            "--name",
            "Article Space",
            "--fixture-responses",
            str(fixture),
            "--apply",
        ],
        check=False,
        env=cli_env(),
        text=True,
        capture_output=True,
    )
    assert result.returncode == 0, result.stderr
    payload = json.loads(result.stdout)
    root = Path(payload["root"])
    status = subprocess.run(
        [
            sys.executable,
            "-m",
            "infospace_bench",
            "generate",
            "status",
            str(root),
        ],
        check=False,
        env=cli_env(),
        text=True,
        capture_output=True,
    )
    assert status.returncode == 0, status.stderr
    status_payload = json.loads(status.stdout)
    assert payload["status"] == "completed"
    assert (root / "artifacts" / "sources" / "reusable-knowledge.md").is_file()
    assert (root / "artifacts" / "entities" / "knowledge-artifact.md").is_file()
    assert (root / "artifacts" / "relations" / "reusable-knowledge-relations.md").is_file()
    assert (root / "output" / "metrics" / "metrics.yaml").is_file()
    assert status_payload["source_chunk_count"] == 1
    assert status_payload["entity_count"] == 2
    assert status_payload["relation_count"] == 1
    assert status_payload["stale"] is False
 def test_generate_from_ebook_and_folder_fixtures(tmp_path: Path) -> None:
    fixture = tmp_path / "responses.yaml"
    fixture_responses(fixture)
    ebook = tmp_path / "book.epub"
    write_epub_fixture(ebook)
    folder = tmp_path / "folder"
    folder.mkdir()
    (folder / "first.md").write_text("# First\n\nOne source.", encoding="utf-8")
    (folder / "second.txt").write_text("Second source.", encoding="utf-8")
    for source, slug, expected_sources in (
        (ebook, "book-space", 2),
        (folder, "folder-space", 2),
    ):
        result = subprocess.run(
            [
                sys.executable,
                "-m",
                "infospace_bench",
                "generate",
                "from-source",
                str(source),
                "--workspace",
                str(tmp_path),
                "--slug",
                slug,
                "--name",
                slug.replace("-", " ").title(),
                "--fixture-responses",
                str(fixture),
                "--apply",
            ],
            check=False,
            env=cli_env(),
            text=True,
            capture_output=True,
        )
        assert result.returncode == 0, result.stderr
        payload = json.loads(result.stdout)
        status = status_generation(Path(payload["root"]))
        assert status["source_chunk_count"] == expected_sources
        assert status["entity_count"] == 2
        assert status["relation_count"] == expected_sources
        assert status["history_snapshot_count"] == 1
 def test_generator_resume_is_idempotent_and_detects_stale_source(tmp_path: Path) -> None:
    source = tmp_path / "note.md"
    source.write_text("# Note\n\nInitial source.", encoding="utf-8")
    fixture = tmp_path / "responses.yaml"
    fixture_responses(fixture)
    root = init_generation_infospace(tmp_path, source, "note-space", name="Note Space").root
    first = run_generation(root, fixture_responses=fixture)
    second = run_generation(root, fixture_responses=fixture, resume=True)
    generated_source = root / "artifacts" / "sources" / "note.md"
    generated_source.write_text("# Note\n\nChanged source.", encoding="utf-8")
    stale_status = status_generation(root)
    assert first.status == "completed"
    assert second.status == "skipped"
    assert second.skipped is True
    assert stale_status["stale"] is True
    assert stale_status["stale_sources"] == ["source/note.md"]
 def test_openrouter_adapter_uses_model_and_records_metadata() -> None:
    requests: list[dict] = []
    def transport(payload: dict, headers: dict[str, str], endpoint: str) -> dict:
        requests.append({"payload": payload, "headers": headers, "endpoint": endpoint})
        return {
            "id": "or-request-1",
            "choices": [{"message": {"content": "# Generated\n\nContent."}}],
            "usage": {"prompt_tokens": 5, "completion_tokens": 3},
        }
    adapter = OpenRouterAssistedGenerationAdapter(
        api_key="test-key",
        model="openai/gpt-4o-mini",
        transport=transport,
        retry_limit=0,
    )
    result = adapter.generate(
        type(
            "Request",
            (),
            {
                "prompt": "Generate markdown.",
                "stage_id": "extract-entities",
                "workflow_id": "generic-source-extract",
                "input_artifact_id": "source/example.md",
                "provider_hint": "openrouter",
                "metadata": {},
            },
        )()
    )
    assert requests[0]["payload"]["model"] == "openai/gpt-4o-mini"
    assert requests[0]["headers"]["Authorization"] == "Bearer test-key"
    assert result.markdown == "# Generated\n\nContent."
    assert result.provider == "openrouter"
    assert result.metadata["model"] == "openai/gpt-4o-mini"
    assert result.metadata["request_id"] == "or-request-1"
    assert result.metadata["usage"]["completion_tokens"] == 3
 def test_generic_generator_docs_cover_openrouter_resume_and_cost_caps() -> None:
    text = Path("docs/generic-source-generator.md").read_text(encoding="utf-8")
    assert "OPENROUTER_API_KEY" in text
    assert "--model" in text
    assert "--max-chunks" in text
    assert "resume" in text.lower()
    assert "fixture-responses" in text
--- a/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md
+++ b/workplans/IB-WP-0015-generic-source-infospace-generator-cli.md
@@ -4,7 +4,7 @@ type: workplan
 title: "Generic Source Infospace Generator CLI"
 domain: markitect
 repo: infospace-bench
-status: planned
+status: completed
 owner: markitect
 topic_slug: markitect
 created: "2026-05-14"
@@ -105,7 +105,7 @@ Default-safe modes:
 ```task
 id: IB-WP-0015-T01
-status: in_progress
+status: done
 priority: high
 state_hub_task_id: "08196bf2-9323-4cd8-860c-4306c965ed60"
 ```
@@ -128,7 +128,7 @@ state_hub_task_id: "08196bf2-9323-4cd8-860c-4306c965ed60"
 ```task
 id: IB-WP-0015-T02
-status: in_progress
+status: done
 priority: high
 state_hub_task_id: "5604796b-cb09-43ed-b3a9-5d4906790807"
 ```
@@ -152,7 +152,7 @@ state_hub_task_id: "5604796b-cb09-43ed-b3a9-5d4906790807"
 ```task
 id: IB-WP-0015-T03
-status: in_progress
+status: done
 priority: high
 state_hub_task_id: "c02720c5-1b82-458a-bf8c-9147af4fd9e9"
 ```
@@ -171,7 +171,7 @@ state_hub_task_id: "c02720c5-1b82-458a-bf8c-9147af4fd9e9"
 ```task
 id: IB-WP-0015-T04
-status: todo
+status: done
 priority: high
 state_hub_task_id: "21b50fbc-f43e-4b18-b012-976a5241f52a"
 ```
@@ -192,7 +192,7 @@ state_hub_task_id: "21b50fbc-f43e-4b18-b012-976a5241f52a"
 ```task
 id: IB-WP-0015-T05
-status: todo
+status: done
 priority: high
 state_hub_task_id: "ad882b6e-924e-4f9a-8e93-119aeadd8132"
 ```
@@ -216,7 +216,7 @@ state_hub_task_id: "ad882b6e-924e-4f9a-8e93-119aeadd8132"
 ```task
 id: IB-WP-0015-T06
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "3461eacf-e42a-455c-954c-849b0ad69fc1"
 ```
@@ -264,3 +264,18 @@ state_hub_task_id: "3461eacf-e42a-455c-954c-849b0ad69fc1"
  - `infospace-bench`: applied infospace generation workflow and CLI
  - `kontextual-engine`: durable runtime/retrieval/audit if needed later
 ## Implementation Notes
 Completed on 2026-05-14.
 - Added generic source intake for Markdown, plain text, local HTML, EPUB-like
  archives, and folder collections.
 - Added the `general-knowledge` profile with prompt templates and contracts.
 - Added an explicit OpenRouter assisted-generation adapter with mocked provider
  tests and environment-based credential lookup.
 - Added `infospace-bench generate` subcommands for init, plan, run, resume,
  status, and from-source flows.
 - Added generation state, resume skipping, source/profile stale detection,
  metrics/history recording, and a manifest-backed generation report.
 - Added deterministic acceptance tests for article, ebook-like, and folder
  generation using fixture responses.