repo-scoping/workplans/RREG-WP-0013-self-scoping-baseline-evaluation.md

---
id: RREG-WP-0013
type: workplan
title: "Self-Scoping Baseline Evaluation"
domain: capabilities
repo: repo-scoping
status: done
owner: codex
topic_slug: foerster-capabilities
created: "2026-05-15"
updated: "2026-05-15"
state_hub_workstream_id: "1c740db0-1999-478b-b3e3-c0fdfec1e9dd"
---

# Self-Scoping Baseline Evaluation

repo-scoping should become a self-improving infrastructure: every meaningful
change to the scoping engine should be testable against a known baseline for
repo-scoping itself. The goal is not just to assert that output changed, but to
make it easy for a human or trusted agent to decide whether an old or new
result is better and preserve that assessment as signal for future engine
iterations.

The motivating failure is the 2026-05-15 self-analysis where deterministic
provider-vocabulary facts were promoted into an approved `Route LLM Requests
Across Providers` capability and the repo's native API/CLI features were
attached under that incorrect capability. Future reruns should make regressions
like that obvious, reviewable, and attributable to the exact repo-scoping
release that generated them.

## T01: Define Self-Scoping Assessment Model

```task
id: RREG-WP-0013-T01
status: done
priority: high
state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"
```

Define the data model for immutable self-scoping assessment runs.

Each assessment must bind together:

- The target repository identity: repo slug, source URL/path, target commit,
  target branch, and dirty-state marker when applicable.
- The engine identity: repo-scoping package version, git commit, git tag or
  release name when available, dirty-state marker, scanner version, candidate
  generator version, quality-gate/ruleset version, schema version, and prompt
  version/hash when LLM or agentic review is used.
- The execution mode: deterministic-only, LLM-assisted, agent-reviewed,
  trusted-auto-review, manual-review, or mixed.
- The generated artifacts: observed fact summary, candidate graph, approved map
  or proposed approval set, rejected/downgraded items, source refs, and review
  notes.
- The assessment outcome: baseline, challenger, preferred, tied, rejected,
  superseded, or needs-human.

Acceptance criteria:
- A documented schema exists for self-scoping assessment runs.
- Assessment runs are append-only; reruns create new records instead of
  rewriting old judgements.
- Engine release binding is required before an assessment can be compared.
- Dirty working trees are visible in the assessment metadata.

## T02: Capture Current Bad Self-Run As A Regression Seed

```task
id: RREG-WP-0013-T02
status: done
priority: high
state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"
```

Import or recreate the known-bad repo-scoping self-analysis as a named
regression seed.

Known bad pattern:

- Candidate/approved capability: `Route LLM Requests Across Providers`.
- Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that
  LLM-provider capability.
- Incorrect evidence: scanner vocabulary, schema examples, tests, and
  provider-name normalization code treated as repo-owned LLM routing behavior.

Acceptance criteria:
- The bad run can be inspected as a historical assessment artifact.
- It is clearly marked as a negative baseline, not a desired golden output.
- The failure explanation is stored next to the captured graph.
- Future comparison reports can flag when a challenger repeats the same pattern.

## T03: Create Desired Repo-Scoping Golden Profile

```task
id: RREG-WP-0013-T03
status: done
priority: high
state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"
```

Author a curated golden profile for repo-scoping itself. This should be compact
enough for comparison but expressive enough to catch hierarchy errors.

Expected native capabilities should cover at least:

- Repository registration and metadata import.
- Deterministic repository scanning into observed facts.
- Source-role and provenance-aware content indexing.
- Candidate characteristic generation from facts and content.
- Candidate review, edit, reject, merge, relink, and approval workflow.
- Approved characteristic search, comparison, export, and capability-gap
  exploration.
- SCOPE.md generation, diffing, validation, and write/update flows.
- Dependency graph and characteristic impact exploration.
- Scope context API support for downstream agents such as activity-core.

Forbidden top-level/native capabilities should include:

- `Route LLM Requests Across Providers`, unless repo-scoping later genuinely
  implements provider routing as a product feature rather than using
  `llm-connect` as optional extraction infrastructure.

Acceptance criteria:
- The golden profile includes ability, capability, feature, and evidence
  expectations with source paths.
- The profile distinguishes native utility from dependencies, fixtures, test
  vocabulary, schema examples, and optional LLM extraction infrastructure.
- The profile is stored in a stable, reviewable fixture location.
- The profile can evolve through explicit assessment decisions.

Implementation note 2026-05-15: added
`docs/schemas/self-scoping-assessment.schema.json`,
`docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json`,
`docs/self-scoping/golden/repo-scoping-golden-profile.v1.json`, and
`tests/test_self_scoping_artifacts.py`. The known-bad artifact is marked as a
negative regression seed with `historical_incomplete` release binding because
the original analysis run did not record the engine commit.

## T04: Export Assessment Artifacts From Analysis Runs

```task
id: RREG-WP-0013-T04
status: done
priority: high
state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"
```

Add a CLI and/or API workflow that exports a completed analysis run as a
self-scoping assessment artifact.

Acceptance criteria:
- Export includes repository metadata, analysis run metadata, engine identity,
  candidate graph, observed fact summary, content chunk summary, approved map
  if present, review decisions, and quality-gate outcomes when available.
- Export format is deterministic JSON with a documented schema.
- Export refuses to mark an artifact comparable when engine identity is
  incomplete.
- Export can target repo-scoping itself without requiring network access.

Implementation note 2026-05-15: added
`src/repo_registry/self_scoping/assessment.py` and the
`repo-scoping export-assessment` CLI command. The exporter reads an existing
completed analysis run, records engine identity, generated candidate tree,
approved map, fact/content summaries, review decisions, empty quality-gate
outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover
the exporter and CLI path.

## T05: Compare Baseline And Challenger Runs

```task
id: RREG-WP-0013-T05
status: done
priority: high
state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"
```

Implement comparison between an existing baseline and a later challenger run.

Comparison should report:

- Added, removed, renamed, and moved abilities/capabilities/features.
- Hierarchy quality changes, especially misplaced features under the wrong
  capability.
- Native-utility precision: whether generated capabilities are repo-owned,
  facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
- Coverage against the repo-scoping golden profile.
- Regression flags for known-bad patterns.
- Source-ref quality: whether claims cite product intent, docs, source, tests,
  fixtures, examples, or generated/derived scope.

Acceptance criteria:
- Comparison output is useful in both machine-readable JSON and human-readable
  Markdown.
- The report makes it easy to choose "old better", "new better", "tie", or
  "needs review".
- It does not require candidates to have stable database IDs across runs.
- It can compare deterministic-only and agent-reviewed runs without losing
  provenance.

Implementation note 2026-05-15: added
`src/repo_registry/self_scoping/comparison.py` and the
`repo-scoping compare-assessment` CLI command. The first comparison report
checks assessment artifacts against the repo-scoping golden profile, reports
missing expected capabilities, forbidden native capability matches, known
regression patterns, and misplaced API/CLI features under provider-routing
capabilities. Reports can be emitted as JSON or Markdown.

## T06: Add Side-By-Side Review UI

```task
id: RREG-WP-0013-T06
status: done
priority: medium
state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"
```

Expose baseline/challenger comparison in the curator UI.

Acceptance criteria:
- Reviewers can select two assessment artifacts for repo-scoping.
- The UI shows the two hierarchy trees side by side with moved/misplaced items
  highlighted.
- Reviewers can record preference, tie, rejection, and notes.
- Review decisions are persisted as assessment outcomes, not as changes to the
  underlying historical artifacts.

Implementation note 2026-05-15: added a file-backed `/ui/self-scoping` curator
surface that reads golden profiles and assessment artifacts from
`docs/self-scoping`, renders side-by-side hierarchy comparisons with regression
highlights, compares two assessment runs directly for old-vs-new judgement, and
records append-only review outcome JSON under `docs/self-scoping/outcomes/`.

## T07: Add Self-Scoping Regression Command

```task
id: RREG-WP-0013-T07
status: done
priority: medium
state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"
```

Add a repeatable command for running repo-scoping against itself and comparing
the result to the active baseline.

Acceptance criteria:
- The command captures engine identity before running analysis.
- The command can run deterministic-only without LLM or agentic review.
- The command can optionally invoke agentic review when configured.
- The command emits a comparison report and exits non-zero only for explicit
  CI-blocking regressions, not for ordinary "needs review" assessment outcomes.

Implementation note 2026-05-15: added `repo-scoping self-assess`. The command
analyzes a source tree, exports a challenger assessment artifact, compares it to
the golden profile, emits JSON or Markdown, and returns non-zero only with
`--fail-on-regression` when the comparison status is `regression`. The command
defaults to deterministic-only; `--with-llm` opts into configured LLM assistance.
`--agentic-review` now records an agentic-review request and leaves candidates
pending when no agentic reviewer is configured.

## T08: Document Assessment Workflow

```task
id: RREG-WP-0013-T08
status: done
priority: medium
state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"
```

Document how maintainers should use self-scoping assessment artifacts while
evolving the engine.

Acceptance criteria:
- Documentation explains baseline, challenger, preferred, tied, rejected, and
  superseded outcomes.
- Documentation explains engine release binding and why unbound output is not
  comparable.
- Documentation gives examples for the known-bad LLM-provider regression and a
  desired native repo-scoping profile.
- Documentation describes when to update the golden profile versus when to fix
  the engine.

Implementation note 2026-05-15: added `docs/self-scoping/workflow.md`. The
workflow documents assessment outcomes, release binding, the standard
self-assessment loop, CI use, when to update the golden profile, when to fix the
engine, and the relationship to RREG-WP-0014 agentic acceptance.

## Completion Criteria

- repo-scoping has an immutable, release-bound self-scoping assessment format.
- The current known-bad output is captured as a negative regression seed.
- A curated desired repo-scoping profile exists.
- Maintainers can rerun repo-scoping on itself, compare old/new results, and
  record which output is better.
- Comparison results are bound to the repo-scoping release that generated them.