Files
repo-scoping/workplans/RREG-WP-0013-self-scoping-baseline-evaluation.md

294 lines
12 KiB
Markdown

---
id: RREG-WP-0013
type: workplan
title: "Self-Scoping Baseline Evaluation"
domain: capabilities
repo: repo-scoping
status: done
owner: codex
topic_slug: foerster-capabilities
created: "2026-05-15"
updated: "2026-05-15"
state_hub_workstream_id: "1c740db0-1999-478b-b3e3-c0fdfec1e9dd"
---
# Self-Scoping Baseline Evaluation
repo-scoping should become a self-improving infrastructure: every meaningful
change to the scoping engine should be testable against a known baseline for
repo-scoping itself. The goal is not just to assert that output changed, but to
make it easy for a human or trusted agent to decide whether an old or new
result is better and preserve that assessment as signal for future engine
iterations.
The motivating failure is the 2026-05-15 self-analysis where deterministic
provider-vocabulary facts were promoted into an approved `Route LLM Requests
Across Providers` capability and the repo's native API/CLI features were
attached under that incorrect capability. Future reruns should make regressions
like that obvious, reviewable, and attributable to the exact repo-scoping
release that generated them.
## T01: Define Self-Scoping Assessment Model
```task
id: RREG-WP-0013-T01
status: done
priority: high
state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"
```
Define the data model for immutable self-scoping assessment runs.
Each assessment must bind together:
- The target repository identity: repo slug, source URL/path, target commit,
target branch, and dirty-state marker when applicable.
- The engine identity: repo-scoping package version, git commit, git tag or
release name when available, dirty-state marker, scanner version, candidate
generator version, quality-gate/ruleset version, schema version, and prompt
version/hash when LLM or agentic review is used.
- The execution mode: deterministic-only, LLM-assisted, agent-reviewed,
trusted-auto-review, manual-review, or mixed.
- The generated artifacts: observed fact summary, candidate graph, approved map
or proposed approval set, rejected/downgraded items, source refs, and review
notes.
- The assessment outcome: baseline, challenger, preferred, tied, rejected,
superseded, or needs-human.
Acceptance criteria:
- A documented schema exists for self-scoping assessment runs.
- Assessment runs are append-only; reruns create new records instead of
rewriting old judgements.
- Engine release binding is required before an assessment can be compared.
- Dirty working trees are visible in the assessment metadata.
## T02: Capture Current Bad Self-Run As A Regression Seed
```task
id: RREG-WP-0013-T02
status: done
priority: high
state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"
```
Import or recreate the known-bad repo-scoping self-analysis as a named
regression seed.
Known bad pattern:
- Candidate/approved capability: `Route LLM Requests Across Providers`.
- Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that
LLM-provider capability.
- Incorrect evidence: scanner vocabulary, schema examples, tests, and
provider-name normalization code treated as repo-owned LLM routing behavior.
Acceptance criteria:
- The bad run can be inspected as a historical assessment artifact.
- It is clearly marked as a negative baseline, not a desired golden output.
- The failure explanation is stored next to the captured graph.
- Future comparison reports can flag when a challenger repeats the same pattern.
## T03: Create Desired Repo-Scoping Golden Profile
```task
id: RREG-WP-0013-T03
status: done
priority: high
state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"
```
Author a curated golden profile for repo-scoping itself. This should be compact
enough for comparison but expressive enough to catch hierarchy errors.
Expected native capabilities should cover at least:
- Repository registration and metadata import.
- Deterministic repository scanning into observed facts.
- Source-role and provenance-aware content indexing.
- Candidate characteristic generation from facts and content.
- Candidate review, edit, reject, merge, relink, and approval workflow.
- Approved characteristic search, comparison, export, and capability-gap
exploration.
- SCOPE.md generation, diffing, validation, and write/update flows.
- Dependency graph and characteristic impact exploration.
- Scope context API support for downstream agents such as activity-core.
Forbidden top-level/native capabilities should include:
- `Route LLM Requests Across Providers`, unless repo-scoping later genuinely
implements provider routing as a product feature rather than using
`llm-connect` as optional extraction infrastructure.
Acceptance criteria:
- The golden profile includes ability, capability, feature, and evidence
expectations with source paths.
- The profile distinguishes native utility from dependencies, fixtures, test
vocabulary, schema examples, and optional LLM extraction infrastructure.
- The profile is stored in a stable, reviewable fixture location.
- The profile can evolve through explicit assessment decisions.
Implementation note 2026-05-15: added
`docs/schemas/self-scoping-assessment.schema.json`,
`docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json`,
`docs/self-scoping/golden/repo-scoping-golden-profile.v1.json`, and
`tests/test_self_scoping_artifacts.py`. The known-bad artifact is marked as a
negative regression seed with `historical_incomplete` release binding because
the original analysis run did not record the engine commit.
## T04: Export Assessment Artifacts From Analysis Runs
```task
id: RREG-WP-0013-T04
status: done
priority: high
state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"
```
Add a CLI and/or API workflow that exports a completed analysis run as a
self-scoping assessment artifact.
Acceptance criteria:
- Export includes repository metadata, analysis run metadata, engine identity,
candidate graph, observed fact summary, content chunk summary, approved map
if present, review decisions, and quality-gate outcomes when available.
- Export format is deterministic JSON with a documented schema.
- Export refuses to mark an artifact comparable when engine identity is
incomplete.
- Export can target repo-scoping itself without requiring network access.
Implementation note 2026-05-15: added
`src/repo_registry/self_scoping/assessment.py` and the
`repo-scoping export-assessment` CLI command. The exporter reads an existing
completed analysis run, records engine identity, generated candidate tree,
approved map, fact/content summaries, review decisions, empty quality-gate
outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover
the exporter and CLI path.
## T05: Compare Baseline And Challenger Runs
```task
id: RREG-WP-0013-T05
status: done
priority: high
state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"
```
Implement comparison between an existing baseline and a later challenger run.
Comparison should report:
- Added, removed, renamed, and moved abilities/capabilities/features.
- Hierarchy quality changes, especially misplaced features under the wrong
capability.
- Native-utility precision: whether generated capabilities are repo-owned,
facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
- Coverage against the repo-scoping golden profile.
- Regression flags for known-bad patterns.
- Source-ref quality: whether claims cite product intent, docs, source, tests,
fixtures, examples, or generated/derived scope.
Acceptance criteria:
- Comparison output is useful in both machine-readable JSON and human-readable
Markdown.
- The report makes it easy to choose "old better", "new better", "tie", or
"needs review".
- It does not require candidates to have stable database IDs across runs.
- It can compare deterministic-only and agent-reviewed runs without losing
provenance.
Implementation note 2026-05-15: added
`src/repo_registry/self_scoping/comparison.py` and the
`repo-scoping compare-assessment` CLI command. The first comparison report
checks assessment artifacts against the repo-scoping golden profile, reports
missing expected capabilities, forbidden native capability matches, known
regression patterns, and misplaced API/CLI features under provider-routing
capabilities. Reports can be emitted as JSON or Markdown.
## T06: Add Side-By-Side Review UI
```task
id: RREG-WP-0013-T06
status: done
priority: medium
state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"
```
Expose baseline/challenger comparison in the curator UI.
Acceptance criteria:
- Reviewers can select two assessment artifacts for repo-scoping.
- The UI shows the two hierarchy trees side by side with moved/misplaced items
highlighted.
- Reviewers can record preference, tie, rejection, and notes.
- Review decisions are persisted as assessment outcomes, not as changes to the
underlying historical artifacts.
Implementation note 2026-05-15: added a file-backed `/ui/self-scoping` curator
surface that reads golden profiles and assessment artifacts from
`docs/self-scoping`, renders side-by-side hierarchy comparisons with regression
highlights, compares two assessment runs directly for old-vs-new judgement, and
records append-only review outcome JSON under `docs/self-scoping/outcomes/`.
## T07: Add Self-Scoping Regression Command
```task
id: RREG-WP-0013-T07
status: done
priority: medium
state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"
```
Add a repeatable command for running repo-scoping against itself and comparing
the result to the active baseline.
Acceptance criteria:
- The command captures engine identity before running analysis.
- The command can run deterministic-only without LLM or agentic review.
- The command can optionally invoke agentic review when configured.
- The command emits a comparison report and exits non-zero only for explicit
CI-blocking regressions, not for ordinary "needs review" assessment outcomes.
Implementation note 2026-05-15: added `repo-scoping self-assess`. The command
analyzes a source tree, exports a challenger assessment artifact, compares it to
the golden profile, emits JSON or Markdown, and returns non-zero only with
`--fail-on-regression` when the comparison status is `regression`. The command
defaults to deterministic-only; `--with-llm` opts into configured LLM assistance.
`--agentic-review` now records an agentic-review request and leaves candidates
pending when no agentic reviewer is configured.
## T08: Document Assessment Workflow
```task
id: RREG-WP-0013-T08
status: done
priority: medium
state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"
```
Document how maintainers should use self-scoping assessment artifacts while
evolving the engine.
Acceptance criteria:
- Documentation explains baseline, challenger, preferred, tied, rejected, and
superseded outcomes.
- Documentation explains engine release binding and why unbound output is not
comparable.
- Documentation gives examples for the known-bad LLM-provider regression and a
desired native repo-scoping profile.
- Documentation describes when to update the golden profile versus when to fix
the engine.
Implementation note 2026-05-15: added `docs/self-scoping/workflow.md`. The
workflow documents assessment outcomes, release binding, the standard
self-assessment loop, CI use, when to update the golden profile, when to fix the
engine, and the relationship to RREG-WP-0014 agentic acceptance.
## Completion Criteria
- repo-scoping has an immutable, release-bound self-scoping assessment format.
- The current known-bad output is captured as a negative regression seed.
- A curated desired repo-scoping profile exists.
- Maintainers can rerun repo-scoping on itself, compare old/new results, and
record which output is better.
- Comparison results are bound to the repo-scoping release that generated them.