one time bootstrap path

2026-05-02 00:36:00 +02:00
parent 911ca45618
commit 76f5ecb1b4
12 changed files with 328 additions and 27 deletions
--- a/docs/characteristic-evidence-model.md
+++ b/docs/characteristic-evidence-model.md
@@ -30,6 +30,33 @@ organized under the wrong capability.
 Observed facts are deterministic scanner output. They describe what was seen in
 the repository: files, languages, frameworks, routes, tests, documentation,
 provider names, configuration variables, and similar source-linked observations.
+Facts can carry a source role so generation can separate product evidence from
+ambient context. Important roles include:
+
+- `intent_summary`: `INTENT.md` or equivalent design-intent material describing
+  why the repository should exist and what utility it is meant to provide.
+- `derived_scope`: `SCOPE.md` or equivalent current-scope material. This is a
+  derived or curated description of what is believed to be true now, not primary
+  evidence for rebuilding the same characteristic model.
+- `product_documentation`: README, docs, specifications, and user-facing guides.
+- `implementation_source`: source code owned by the repository.
+- `dependency_declaration`: manifests, imports, lockfiles, and package metadata.
+- `configuration`, `ci_tooling`, `test_evidence`, and `agent_guidance`.
+
+`INTENT.md` and `SCOPE.md` deliberately answer different questions. Intent is a
+design artifact: what the repository is supposed to become or provide. Scope is
+a derived current-state artifact: what the repository is understood to provide
+after evidence and review. A good `SCOPE.md` is valuable context, but using it
+as ordinary evidence for generated characteristics creates a circular model.
+Rebuilds should therefore prefer `INTENT.md`, product documentation, source, and
+tests; `SCOPE.md` should be used as comparison material or explicit bootstrap
+input only when a curator chooses that mode.
+
+For repositories that already have a useful `SCOPE.md` but no `INTENT.md`,
+repo-scoping can perform a one-time bootstrap by copying the scope text into a
+new intent file with a clear provenance note. After that bootstrap, the files
+should diverge naturally: `INTENT.md` remains design intent, while `SCOPE.md`
+remains generated or curated current scope.

 Source references point from interpreted claims back to files or facts.

--- a/docs/terminology.md
+++ b/docs/terminology.md
@@ -42,6 +42,20 @@ normalization.
  facts or to lower-level characteristics.
 - Observed fact: deterministic scanner output such as files, manifests,
  languages, tests, APIs, routes, commands, or documentation references.
+- Intent: a design-time statement of expected repository utility. `INTENT.md`
+  is the preferred file for this. It can guide candidate generation because it
+  describes why the repository should exist.
+- Derived scope: a current-state statement of what the repository is understood
+  to provide. `SCOPE.md` is the preferred file for this. It is generated or
+  curated from evidence and approved characteristics, so it should not be used
+  as ordinary evidence for rebuilding those same characteristics.
+- Intent bootstrap: a one-time migration that creates `INTENT.md` from an
+  existing `SCOPE.md` when no intent file exists. The generated file carries a
+  provenance note and should be reviewed as design intent.
+- Source role: provenance metadata on a fact or content chunk, such as
+  `intent_summary`, `derived_scope`, `product_documentation`,
+  `implementation_source`, `dependency_declaration`, `configuration`,
+  `ci_tooling`, `test_evidence`, or `agent_guidance`.
 - Candidate: proposed characteristic or evidence from deterministic heuristics
  or optional LLM assistance. Candidates are review inputs, not registry truth.
 - Approved: curated registry truth that appears in ability maps, search, exports,
--- a/src/repo_registry/candidate_graph/generator.py
+++ b/src/repo_registry/candidate_graph/generator.py
@@ -63,8 +63,7 @@ class CandidateGraphGenerator:
            return []
        chunks = chunks or []

-        scope_docs = self._facts(facts, "scope")
-        docs = scope_docs + self._facts(facts, "documentation")
+        docs = self._facts(facts, "intent") + self._facts(facts, "documentation")
        tests = self._facts(facts, "test")
        examples = self._facts(facts, "example")
        interfaces = self._facts(facts, "interface")
@@ -662,7 +661,7 @@ class CandidateGraphGenerator:

    def _document_purpose_sentence(self, chunks: list[ContentChunk]) -> str:
        for chunk in self._documentation_chunks(chunks):
-            if chunk.kind not in {"scope", "documentation"}:
+            if chunk.kind not in {"intent", "documentation"}:
                continue
            lines = [line.strip() for line in chunk.text.splitlines() if line.strip()]
            paragraph = next((line for line in lines if not line.startswith("#")), "")
@@ -745,8 +744,8 @@ class CandidateGraphGenerator:

    def _documentation_chunks(self, chunks: list[ContentChunk]) -> list[ContentChunk]:
        return sorted(
-            [chunk for chunk in chunks if chunk.kind in {"scope", "documentation"}],
-            key=lambda chunk: (0 if chunk.kind == "scope" else 1, chunk.path, chunk.start_line),
+            [chunk for chunk in chunks if chunk.kind in {"intent", "documentation"}],
+            key=lambda chunk: (0 if chunk.kind == "intent" else 1, chunk.path, chunk.start_line),
        )

    def _interface_summary(self, chunks: list[ContentChunk]) -> str:
--- a/src/repo_registry/content_indexing/extractor.py
+++ b/src/repo_registry/content_indexing/extractor.py
@@ -7,6 +7,7 @@ from repo_registry.core.models import ObservedFact


 INDEXED_FACT_KINDS = {
+    "intent",
    "scope",
    "documentation",
    "example",
--- a/src/repo_registry/intent/init.py
+++ b/src/repo_registry/intent/init.py
@@ -0,0 +1 @@
+"""Intent-file helpers for repository scoping."""
--- a/src/repo_registry/intent/bootstrap.py
+++ b/src/repo_registry/intent/bootstrap.py
@@ -0,0 +1,130 @@
+from __future__ import annotations
+
+import argparse
+from dataclasses import dataclass
+from datetime import date
+from pathlib import Path
+from typing import Iterable
+
+
+BOOTSTRAP_NOTE = (
+    "> Bootstrapped from `SCOPE.md` by repo-scoping.\n"
+    "> Review and edit this file as design intent. `SCOPE.md` remains the\n"
+    "> derived current-scope artifact."
+)
+
+
+@dataclass(frozen=True)
+class IntentBootstrapResult:
+    repo_path: str
+    scope_path: str
+    intent_path: str
+    status: str
+    message: str
+
+
+def bootstrap_intent_from_scope(
+    repo_path: str | Path,
+    *,
+    dry_run: bool = False,
+    overwrite: bool = False,
+    today: date | None = None,
+) -> IntentBootstrapResult:
+    root = Path(repo_path).expanduser().resolve()
+    scope_path = root / "SCOPE.md"
+    intent_path = root / "INTENT.md"
+
+    if not root.is_dir():
+        return _result(root, scope_path, intent_path, "missing_repo", "repository path does not exist")
+    if not scope_path.is_file():
+        return _result(root, scope_path, intent_path, "missing_scope", "SCOPE.md is not present")
+    if intent_path.exists() and not overwrite:
+        return _result(root, scope_path, intent_path, "exists", "INTENT.md already exists")
+
+    status = "would_overwrite" if intent_path.exists() else "would_create"
+    if dry_run:
+        return _result(root, scope_path, intent_path, status, f"{status} INTENT.md from SCOPE.md")
+
+    intent_text = scope_to_intent_text(
+        scope_path.read_text(encoding="utf-8"),
+        today=today,
+    )
+    intent_path.write_text(intent_text, encoding="utf-8")
+    created_status = "overwritten" if status == "would_overwrite" else "created"
+    return _result(root, scope_path, intent_path, created_status, f"{created_status} INTENT.md from SCOPE.md")
+
+
+def bootstrap_many(
+    repo_paths: Iterable[str | Path],
+    *,
+    dry_run: bool = False,
+    overwrite: bool = False,
+    today: date | None = None,
+) -> list[IntentBootstrapResult]:
+    return [
+        bootstrap_intent_from_scope(
+            repo_path,
+            dry_run=dry_run,
+            overwrite=overwrite,
+            today=today,
+        )
+        for repo_path in repo_paths
+    ]
+
+
+def scope_to_intent_text(scope_text: str, *, today: date | None = None) -> str:
+    current_date = today or date.today()
+    lines = scope_text.splitlines()
+    while lines and not lines[0].strip():
+        lines.pop(0)
+
+    if lines and lines[0].lstrip().lower().startswith("# scope"):
+        lines[0] = "# INTENT"
+    elif not lines or not lines[0].startswith("#"):
+        lines.insert(0, "# INTENT")
+
+    note = f"{BOOTSTRAP_NOTE}\n> Bootstrap date: {current_date.isoformat()}"
+    insert_at = 1 if lines else 0
+    while insert_at < len(lines) and not lines[insert_at].strip():
+        insert_at += 1
+    lines[insert_at:insert_at] = ["", note, ""]
+    return "\n".join(lines).rstrip() + "\n"
+
+
+def _result(
+    root: Path,
+    scope_path: Path,
+    intent_path: Path,
+    status: str,
+    message: str,
+) -> IntentBootstrapResult:
+    return IntentBootstrapResult(
+        repo_path=str(root),
+        scope_path=str(scope_path),
+        intent_path=str(intent_path),
+        status=status,
+        message=message,
+    )
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        description="Bootstrap INTENT.md from SCOPE.md for repositories that do not have intent files yet."
+    )
+    parser.add_argument("repo_paths", nargs="+", help="Repository checkout path(s) to inspect")
+    parser.add_argument("--dry-run", action="store_true", help="Report planned writes without writing files")
+    parser.add_argument("--overwrite", action="store_true", help="Overwrite existing INTENT.md files")
+    args = parser.parse_args(argv)
+
+    results = bootstrap_many(
+        args.repo_paths,
+        dry_run=args.dry_run,
+        overwrite=args.overwrite,
+    )
+    for result in results:
+        print(f"{result.status}\t{result.repo_path}\t{result.message}")
+    return 1 if any(result.status in {"missing_repo", "missing_scope"} for result in results) else 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/src/repo_registry/repo_scanning/scanner.py
+++ b/src/repo_registry/repo_scanning/scanner.py
@@ -180,13 +180,22 @@ class DeterministicScanner:
            name = path.name.lower()
            source_role = self._source_role(relative)

-            if name == "scope.md":
+            if name == "intent.md":
+                facts.append(
+                    FactCandidate(
+                        "intent",
+                        "INTENT",
+                        relative,
+                        metadata={"source_role": "intent_summary"},
+                    )
+                )
+            elif name == "scope.md":
                facts.append(
                    FactCandidate(
                        "scope",
                        "SCOPE",
                        relative,
-                        metadata={"source_role": "scope_summary"},
+                        metadata={"source_role": "derived_scope"},
                    )
                )
            elif name.startswith("readme"):
@@ -429,8 +438,10 @@ class DeterministicScanner:
        lower = relative_path.lower()
        parts = lower.split("/")
        name = parts[-1]
+        if name == "intent.md":
+            return "intent_summary"
        if name == "scope.md":
-            return "scope_summary"
+            return "derived_scope"
        if name in AGENT_GUIDANCE_FILES or any(part in AGENT_GUIDANCE_DIRS for part in parts):
            return "agent_guidance"
        if lower.startswith((".github/workflows/", ".gitea/workflows/")):
--- a/tests/test_candidate_graph.py
+++ b/tests/test_candidate_graph.py
@@ -108,6 +108,51 @@ def test_candidate_generator_enriches_descriptions_from_content_chunks():
    assert '@app.post("/classify")' in graph[0].capabilities[0].description


+def test_candidate_generator_prefers_intent_over_derived_scope_chunks():
+    repository = Repository(
+        id=1,
+        name="KeyCape",
+        url="/tmp/key-cape",
+        description=None,
+        branch="main",
+        status="analyzed",
+    )
+    facts = [
+        fact(1, "intent", "INTENT", "INTENT.md"),
+        fact(2, "scope", "SCOPE", "SCOPE.md"),
+        fact(3, "documentation", "README", "README.md"),
+    ]
+    chunks = [
+        chunk(
+            1,
+            "scope",
+            "SCOPE.md",
+            "# SCOPE\nAlready provides deployed IAM runtime behavior.",
+            end_line=2,
+        ),
+        chunk(
+            2,
+            "intent",
+            "INTENT.md",
+            "# INTENT\nDesign a lightweight IAM profile implementation.",
+            end_line=2,
+        ),
+        chunk(
+            3,
+            "documentation",
+            "README.md",
+            "# KeyCape\nREADME fallback should not beat intent.",
+            end_line=2,
+        ),
+    ]
+
+    graph = CandidateGraphGenerator().generate(repository, facts, chunks)
+
+    assert graph[0].name == "Design A Lightweight IAM Profile Implementation"
+    assert "INTENT. Design a lightweight IAM profile implementation" in graph[0].description
+    assert graph[0].source_refs[0].path == "INTENT.md"
+
+
 def test_candidate_confidence_scoring_stays_conservative_for_weak_facts():
    repository = Repository(
        id=1,
--- a/tests/test_content_indexing.py
+++ b/tests/test_content_indexing.py
@@ -86,18 +86,18 @@ def test_content_extractor_chunks_provider_related_config(tmp_path):
    assert "OPENROUTER_API_KEY" in chunks[0].text


-def test_content_extractor_preserves_source_role_metadata(tmp_path):
+def test_content_extractor_preserves_intent_source_role_metadata(tmp_path):
    repo = tmp_path / "repo"
    repo.mkdir()
-    (repo / "SCOPE.md").write_text("# SCOPE\n\nProvides OIDC.\n", encoding="utf-8")
+    (repo / "INTENT.md").write_text("# INTENT\n\nProvide OIDC.\n", encoding="utf-8")

    chunks = ContentExtractor().extract(
        repo,
        [
-            fact(1, "scope", "SCOPE", "SCOPE.md", source_role="scope_summary"),
+            fact(1, "intent", "INTENT", "INTENT.md", source_role="intent_summary"),
        ],
    )

    assert len(chunks) == 1
-    assert chunks[0].kind == "scope"
-    assert chunks[0].metadata["source_role"] == "scope_summary"
+    assert chunks[0].kind == "intent"
+    assert chunks[0].metadata["source_role"] == "intent_summary"
--- a/tests/test_intent_bootstrap.py
+++ b/tests/test_intent_bootstrap.py
@@ -0,0 +1,51 @@
+from datetime import date
+
+from repo_registry.intent.bootstrap import bootstrap_intent_from_scope, scope_to_intent_text
+
+
+def test_scope_to_intent_text_replaces_scope_heading_and_marks_bootstrap():
+    text = scope_to_intent_text(
+        "# SCOPE.md - Demo\n\n## One-liner\n\nCurrent utility.\n",
+        today=date(2026, 5, 2),
+    )
+
+    assert text.startswith("# INTENT\n\n")
+    assert "Bootstrapped from `SCOPE.md`" in text
+    assert "Bootstrap date: 2026-05-02" in text
+    assert "## One-liner\n\nCurrent utility." in text
+
+
+def test_bootstrap_intent_from_scope_creates_intent_when_missing(tmp_path):
+    repo = tmp_path / "repo"
+    repo.mkdir()
+    (repo / "SCOPE.md").write_text("# SCOPE\n\nProvides search.\n", encoding="utf-8")
+
+    result = bootstrap_intent_from_scope(repo, today=date(2026, 5, 2))
+
+    assert result.status == "created"
+    intent_text = (repo / "INTENT.md").read_text(encoding="utf-8")
+    assert intent_text.startswith("# INTENT")
+    assert "Provides search." in intent_text
+
+
+def test_bootstrap_intent_from_scope_does_not_overwrite_existing_intent(tmp_path):
+    repo = tmp_path / "repo"
+    repo.mkdir()
+    (repo / "SCOPE.md").write_text("# SCOPE\n", encoding="utf-8")
+    (repo / "INTENT.md").write_text("# INTENT\n\nKeep me.\n", encoding="utf-8")
+
+    result = bootstrap_intent_from_scope(repo)
+
+    assert result.status == "exists"
+    assert (repo / "INTENT.md").read_text(encoding="utf-8") == "# INTENT\n\nKeep me.\n"
+
+
+def test_bootstrap_intent_from_scope_dry_run_reports_without_writing(tmp_path):
+    repo = tmp_path / "repo"
+    repo.mkdir()
+    (repo / "SCOPE.md").write_text("# SCOPE\n", encoding="utf-8")
+
+    result = bootstrap_intent_from_scope(repo, dry_run=True)
+
+    assert result.status == "would_create"
+    assert not (repo / "INTENT.md").exists()
--- a/tests/test_repository_scanner.py
+++ b/tests/test_repository_scanner.py
@@ -42,20 +42,29 @@ def test_deterministic_scanner_extracts_structural_facts(tmp_path):
    assert languages == {"Python": 2}


-def test_scanner_records_scope_with_source_role(tmp_path):
+def test_scanner_records_intent_and_scope_with_distinct_source_roles(tmp_path):
    repo = tmp_path / "sample"
    repo.mkdir()
+    (repo / "INTENT.md").write_text(
+        "# INTENT\n\nProvides planned OIDC profile enforcement.\n",
+        encoding="utf-8",
+    )
    (repo / "SCOPE.md").write_text(
-        "# SCOPE\n\n## One-liner\n\nProvides OIDC profile enforcement.\n",
+        "# SCOPE\n\n## One-liner\n\nCurrently provides OIDC profile enforcement.\n",
        encoding="utf-8",
    )

    result = DeterministicScanner().scan(repo)

+    intent_fact = next(fact for fact in result.facts if fact.kind == "intent")
+    assert intent_fact.name == "INTENT"
+    assert intent_fact.path == "INTENT.md"
+    assert intent_fact.metadata["source_role"] == "intent_summary"
+
    scope_fact = next(fact for fact in result.facts if fact.kind == "scope")
    assert scope_fact.name == "SCOPE"
    assert scope_fact.path == "SCOPE.md"
-    assert scope_fact.metadata["source_role"] == "scope_summary"
+    assert scope_fact.metadata["source_role"] == "derived_scope"


 def test_scanner_readme_only_fixture_records_docs_without_interfaces(tmp_path):
--- a/workplans/RREG-WP-0009-provenance-aware-characteristic-rebuild.md
+++ b/workplans/RREG-WP-0009-provenance-aware-characteristic-rebuild.md
@@ -23,8 +23,9 @@ dependency, import, or operational convention mentioned in its files.
 The target behavior is facts-first and provenance-aware:

 - Deterministic scanning observes facts without over-interpreting them.
- Facts carry source roles such as product documentation, scope summary,
-  implementation source, dependency declaration, agent guidance, or CI/tooling.
+- Facts carry source roles such as intent summary, derived scope, product
+  documentation, implementation source, dependency declaration, agent guidance,
+  or CI/tooling.
 - Characteristic generation promotes only repository-owned utility unless the
  repository clearly acts as a facade or adapter for another capability.
 - Rebuild workflows can discard old approved characteristics and regenerate a
@@ -44,7 +45,12 @@ generation can distinguish product evidence from ambient context.

 Initial source roles:

- `scope_summary`: `SCOPE.md` and other canonical scope files.
+- `intent_summary`: `INTENT.md` and other design-intent files that describe why
+  the repository should exist and what utility it is meant to provide.
+- `derived_scope`: `SCOPE.md` and other generated or curated current-scope
+  files. These are valuable context, but should not be treated as primary
+  evidence for regenerating characteristics unless a curator explicitly chooses
+  a bootstrap/import mode.
 - `product_documentation`: README, docs, specifications, user-facing guides.
 - `implementation_source`: code files owned by the repository.
 - `test_evidence`: test and acceptance files.
@@ -59,8 +65,10 @@ Initial source roles:
 Acceptance criteria:
 - Observed facts can carry a source role in metadata without breaking existing
  storage or API consumers.
- `SCOPE.md` is indexed as `scope_summary` and gets high priority during
+- `INTENT.md` is indexed as `intent_summary` and gets high priority during
  candidate generation.
+- `SCOPE.md` is indexed as `derived_scope` and remains distinguishable from
+  source evidence and design intent.
 - Agent guidance files are classified separately from product documentation.
 - Content chunks preserve the fact source role used to produce them.

@@ -113,19 +121,24 @@ Acceptance criteria:

 ```task
 id: RREG-WP-0009-T04
-status: todo
+status: in_progress
 priority: high
 state_hub_task_id: "4f666cd6-471e-4af9-b53c-4f3d7a1d1973"
 ```

-Use canonical scope files and product documentation as stronger evidence for
+Use explicit intent files and product documentation as stronger evidence for
 expected repository utility than ambient config, CI files, dependency mentions,
-or agent instructions.
+agent instructions, or previously derived scope files.

 Acceptance criteria:
- Candidate ability naming prefers `SCOPE.md` one-liner/core idea when present.
- Candidate capability generation can extract explicit `Provided Capabilities`
-  blocks from `SCOPE.md`.
+- Candidate ability naming prefers `INTENT.md` one-liner/core idea when present.
+- Candidate capability generation can extract explicit intended capability
+  blocks from `INTENT.md`.
+- `SCOPE.md` is treated as derived current scope, not as ordinary evidence for
+  rebuilding the characteristic model from scratch.
+- Existing `SCOPE.md` files can be explicitly bootstrapped into initial
+  `INTENT.md` files when no intent file exists; this is a one-time migration
+  aid, not an ongoing equivalence between scope and intent.
 - README/docs/spec evidence is weighted above CI/tooling and generic config.
 - key-cape generates candidates centered on lightweight IAM, OIDC/PKCE profile
  enforcement, migration tooling, and LDAP/schema validation rather than LLM
@@ -226,7 +239,7 @@ Acceptance criteria:

 ```task
 id: RREG-WP-0009-T09
-status: todo
+status: in_progress
 priority: medium
 state_hub_task_id: "071f6d76-c92b-4ac1-825c-edcbef4bdbf6"
 ```
				`@@ -0,0 +1 @@`
				`"""Intent-file helpers for repository scoping."""`