--- id: RREG-WP-0013 type: workplan title: "Self-Scoping Baseline Evaluation" domain: capabilities repo: repo-scoping status: done owner: codex topic_slug: foerster-capabilities created: "2026-05-15" updated: "2026-05-15" state_hub_workstream_id: "1c740db0-1999-478b-b3e3-c0fdfec1e9dd" --- # Self-Scoping Baseline Evaluation repo-scoping should become a self-improving infrastructure: every meaningful change to the scoping engine should be testable against a known baseline for repo-scoping itself. The goal is not just to assert that output changed, but to make it easy for a human or trusted agent to decide whether an old or new result is better and preserve that assessment as signal for future engine iterations. The motivating failure is the 2026-05-15 self-analysis where deterministic provider-vocabulary facts were promoted into an approved `Route LLM Requests Across Providers` capability and the repo's native API/CLI features were attached under that incorrect capability. Future reruns should make regressions like that obvious, reviewable, and attributable to the exact repo-scoping release that generated them. ## T01: Define Self-Scoping Assessment Model ```task id: RREG-WP-0013-T01 status: done priority: high state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31" ``` Define the data model for immutable self-scoping assessment runs. Each assessment must bind together: - The target repository identity: repo slug, source URL/path, target commit, target branch, and dirty-state marker when applicable. - The engine identity: repo-scoping package version, git commit, git tag or release name when available, dirty-state marker, scanner version, candidate generator version, quality-gate/ruleset version, schema version, and prompt version/hash when LLM or agentic review is used. - The execution mode: deterministic-only, LLM-assisted, agent-reviewed, trusted-auto-review, manual-review, or mixed. - The generated artifacts: observed fact summary, candidate graph, approved map or proposed approval set, rejected/downgraded items, source refs, and review notes. - The assessment outcome: baseline, challenger, preferred, tied, rejected, superseded, or needs-human. Acceptance criteria: - A documented schema exists for self-scoping assessment runs. - Assessment runs are append-only; reruns create new records instead of rewriting old judgements. - Engine release binding is required before an assessment can be compared. - Dirty working trees are visible in the assessment metadata. ## T02: Capture Current Bad Self-Run As A Regression Seed ```task id: RREG-WP-0013-T02 status: done priority: high state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48" ``` Import or recreate the known-bad repo-scoping self-analysis as a named regression seed. Known bad pattern: - Candidate/approved capability: `Route LLM Requests Across Providers`. - Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that LLM-provider capability. - Incorrect evidence: scanner vocabulary, schema examples, tests, and provider-name normalization code treated as repo-owned LLM routing behavior. Acceptance criteria: - The bad run can be inspected as a historical assessment artifact. - It is clearly marked as a negative baseline, not a desired golden output. - The failure explanation is stored next to the captured graph. - Future comparison reports can flag when a challenger repeats the same pattern. ## T03: Create Desired Repo-Scoping Golden Profile ```task id: RREG-WP-0013-T03 status: done priority: high state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521" ``` Author a curated golden profile for repo-scoping itself. This should be compact enough for comparison but expressive enough to catch hierarchy errors. Expected native capabilities should cover at least: - Repository registration and metadata import. - Deterministic repository scanning into observed facts. - Source-role and provenance-aware content indexing. - Candidate characteristic generation from facts and content. - Candidate review, edit, reject, merge, relink, and approval workflow. - Approved characteristic search, comparison, export, and capability-gap exploration. - SCOPE.md generation, diffing, validation, and write/update flows. - Dependency graph and characteristic impact exploration. - Scope context API support for downstream agents such as activity-core. Forbidden top-level/native capabilities should include: - `Route LLM Requests Across Providers`, unless repo-scoping later genuinely implements provider routing as a product feature rather than using `llm-connect` as optional extraction infrastructure. Acceptance criteria: - The golden profile includes ability, capability, feature, and evidence expectations with source paths. - The profile distinguishes native utility from dependencies, fixtures, test vocabulary, schema examples, and optional LLM extraction infrastructure. - The profile is stored in a stable, reviewable fixture location. - The profile can evolve through explicit assessment decisions. Implementation note 2026-05-15: added `docs/schemas/self-scoping-assessment.schema.json`, `docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json`, `docs/self-scoping/golden/repo-scoping-golden-profile.v1.json`, and `tests/test_self_scoping_artifacts.py`. The known-bad artifact is marked as a negative regression seed with `historical_incomplete` release binding because the original analysis run did not record the engine commit. ## T04: Export Assessment Artifacts From Analysis Runs ```task id: RREG-WP-0013-T04 status: done priority: high state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00" ``` Add a CLI and/or API workflow that exports a completed analysis run as a self-scoping assessment artifact. Acceptance criteria: - Export includes repository metadata, analysis run metadata, engine identity, candidate graph, observed fact summary, content chunk summary, approved map if present, review decisions, and quality-gate outcomes when available. - Export format is deterministic JSON with a documented schema. - Export refuses to mark an artifact comparable when engine identity is incomplete. - Export can target repo-scoping itself without requiring network access. Implementation note 2026-05-15: added `src/repo_registry/self_scoping/assessment.py` and the `repo-scoping export-assessment` CLI command. The exporter reads an existing completed analysis run, records engine identity, generated candidate tree, approved map, fact/content summaries, review decisions, empty quality-gate outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover the exporter and CLI path. ## T05: Compare Baseline And Challenger Runs ```task id: RREG-WP-0013-T05 status: done priority: high state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0" ``` Implement comparison between an existing baseline and a later challenger run. Comparison should report: - Added, removed, renamed, and moved abilities/capabilities/features. - Hierarchy quality changes, especially misplaced features under the wrong capability. - Native-utility precision: whether generated capabilities are repo-owned, facade/adapter, dependency, tooling, fixture, schema-example, or mention-only. - Coverage against the repo-scoping golden profile. - Regression flags for known-bad patterns. - Source-ref quality: whether claims cite product intent, docs, source, tests, fixtures, examples, or generated/derived scope. Acceptance criteria: - Comparison output is useful in both machine-readable JSON and human-readable Markdown. - The report makes it easy to choose "old better", "new better", "tie", or "needs review". - It does not require candidates to have stable database IDs across runs. - It can compare deterministic-only and agent-reviewed runs without losing provenance. Implementation note 2026-05-15: added `src/repo_registry/self_scoping/comparison.py` and the `repo-scoping compare-assessment` CLI command. The first comparison report checks assessment artifacts against the repo-scoping golden profile, reports missing expected capabilities, forbidden native capability matches, known regression patterns, and misplaced API/CLI features under provider-routing capabilities. Reports can be emitted as JSON or Markdown. ## T06: Add Side-By-Side Review UI ```task id: RREG-WP-0013-T06 status: done priority: medium state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b" ``` Expose baseline/challenger comparison in the curator UI. Acceptance criteria: - Reviewers can select two assessment artifacts for repo-scoping. - The UI shows the two hierarchy trees side by side with moved/misplaced items highlighted. - Reviewers can record preference, tie, rejection, and notes. - Review decisions are persisted as assessment outcomes, not as changes to the underlying historical artifacts. Implementation note 2026-05-15: added a file-backed `/ui/self-scoping` curator surface that reads golden profiles and assessment artifacts from `docs/self-scoping`, renders side-by-side hierarchy comparisons with regression highlights, compares two assessment runs directly for old-vs-new judgement, and records append-only review outcome JSON under `docs/self-scoping/outcomes/`. ## T07: Add Self-Scoping Regression Command ```task id: RREG-WP-0013-T07 status: done priority: medium state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55" ``` Add a repeatable command for running repo-scoping against itself and comparing the result to the active baseline. Acceptance criteria: - The command captures engine identity before running analysis. - The command can run deterministic-only without LLM or agentic review. - The command can optionally invoke agentic review when configured. - The command emits a comparison report and exits non-zero only for explicit CI-blocking regressions, not for ordinary "needs review" assessment outcomes. Implementation note 2026-05-15: added `repo-scoping self-assess`. The command analyzes a source tree, exports a challenger assessment artifact, compares it to the golden profile, emits JSON or Markdown, and returns non-zero only with `--fail-on-regression` when the comparison status is `regression`. The command defaults to deterministic-only; `--with-llm` opts into configured LLM assistance. `--agentic-review` now records an agentic-review request and leaves candidates pending when no agentic reviewer is configured. ## T08: Document Assessment Workflow ```task id: RREG-WP-0013-T08 status: done priority: medium state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7" ``` Document how maintainers should use self-scoping assessment artifacts while evolving the engine. Acceptance criteria: - Documentation explains baseline, challenger, preferred, tied, rejected, and superseded outcomes. - Documentation explains engine release binding and why unbound output is not comparable. - Documentation gives examples for the known-bad LLM-provider regression and a desired native repo-scoping profile. - Documentation describes when to update the golden profile versus when to fix the engine. Implementation note 2026-05-15: added `docs/self-scoping/workflow.md`. The workflow documents assessment outcomes, release binding, the standard self-assessment loop, CI use, when to update the golden profile, when to fix the engine, and the relationship to RREG-WP-0014 agentic acceptance. ## Completion Criteria - repo-scoping has an immutable, release-bound self-scoping assessment format. - The current known-bad output is captured as a negative regression seed. - A curated desired repo-scoping profile exists. - Maintainers can rerun repo-scoping on itself, compare old/new results, and record which output is better. - Comparison results are bound to the repo-scoping release that generated them.