generated from coulomb/repo-seed
294 lines
12 KiB
Markdown
294 lines
12 KiB
Markdown
---
|
|
id: RREG-WP-0013
|
|
type: workplan
|
|
title: "Self-Scoping Baseline Evaluation"
|
|
domain: capabilities
|
|
repo: repo-scoping
|
|
status: done
|
|
owner: codex
|
|
topic_slug: foerster-capabilities
|
|
created: "2026-05-15"
|
|
updated: "2026-05-15"
|
|
state_hub_workstream_id: "1c740db0-1999-478b-b3e3-c0fdfec1e9dd"
|
|
---
|
|
|
|
# Self-Scoping Baseline Evaluation
|
|
|
|
repo-scoping should become a self-improving infrastructure: every meaningful
|
|
change to the scoping engine should be testable against a known baseline for
|
|
repo-scoping itself. The goal is not just to assert that output changed, but to
|
|
make it easy for a human or trusted agent to decide whether an old or new
|
|
result is better and preserve that assessment as signal for future engine
|
|
iterations.
|
|
|
|
The motivating failure is the 2026-05-15 self-analysis where deterministic
|
|
provider-vocabulary facts were promoted into an approved `Route LLM Requests
|
|
Across Providers` capability and the repo's native API/CLI features were
|
|
attached under that incorrect capability. Future reruns should make regressions
|
|
like that obvious, reviewable, and attributable to the exact repo-scoping
|
|
release that generated them.
|
|
|
|
## T01: Define Self-Scoping Assessment Model
|
|
|
|
```task
|
|
id: RREG-WP-0013-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"
|
|
```
|
|
|
|
Define the data model for immutable self-scoping assessment runs.
|
|
|
|
Each assessment must bind together:
|
|
|
|
- The target repository identity: repo slug, source URL/path, target commit,
|
|
target branch, and dirty-state marker when applicable.
|
|
- The engine identity: repo-scoping package version, git commit, git tag or
|
|
release name when available, dirty-state marker, scanner version, candidate
|
|
generator version, quality-gate/ruleset version, schema version, and prompt
|
|
version/hash when LLM or agentic review is used.
|
|
- The execution mode: deterministic-only, LLM-assisted, agent-reviewed,
|
|
trusted-auto-review, manual-review, or mixed.
|
|
- The generated artifacts: observed fact summary, candidate graph, approved map
|
|
or proposed approval set, rejected/downgraded items, source refs, and review
|
|
notes.
|
|
- The assessment outcome: baseline, challenger, preferred, tied, rejected,
|
|
superseded, or needs-human.
|
|
|
|
Acceptance criteria:
|
|
- A documented schema exists for self-scoping assessment runs.
|
|
- Assessment runs are append-only; reruns create new records instead of
|
|
rewriting old judgements.
|
|
- Engine release binding is required before an assessment can be compared.
|
|
- Dirty working trees are visible in the assessment metadata.
|
|
|
|
## T02: Capture Current Bad Self-Run As A Regression Seed
|
|
|
|
```task
|
|
id: RREG-WP-0013-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"
|
|
```
|
|
|
|
Import or recreate the known-bad repo-scoping self-analysis as a named
|
|
regression seed.
|
|
|
|
Known bad pattern:
|
|
|
|
- Candidate/approved capability: `Route LLM Requests Across Providers`.
|
|
- Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that
|
|
LLM-provider capability.
|
|
- Incorrect evidence: scanner vocabulary, schema examples, tests, and
|
|
provider-name normalization code treated as repo-owned LLM routing behavior.
|
|
|
|
Acceptance criteria:
|
|
- The bad run can be inspected as a historical assessment artifact.
|
|
- It is clearly marked as a negative baseline, not a desired golden output.
|
|
- The failure explanation is stored next to the captured graph.
|
|
- Future comparison reports can flag when a challenger repeats the same pattern.
|
|
|
|
## T03: Create Desired Repo-Scoping Golden Profile
|
|
|
|
```task
|
|
id: RREG-WP-0013-T03
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"
|
|
```
|
|
|
|
Author a curated golden profile for repo-scoping itself. This should be compact
|
|
enough for comparison but expressive enough to catch hierarchy errors.
|
|
|
|
Expected native capabilities should cover at least:
|
|
|
|
- Repository registration and metadata import.
|
|
- Deterministic repository scanning into observed facts.
|
|
- Source-role and provenance-aware content indexing.
|
|
- Candidate characteristic generation from facts and content.
|
|
- Candidate review, edit, reject, merge, relink, and approval workflow.
|
|
- Approved characteristic search, comparison, export, and capability-gap
|
|
exploration.
|
|
- SCOPE.md generation, diffing, validation, and write/update flows.
|
|
- Dependency graph and characteristic impact exploration.
|
|
- Scope context API support for downstream agents such as activity-core.
|
|
|
|
Forbidden top-level/native capabilities should include:
|
|
|
|
- `Route LLM Requests Across Providers`, unless repo-scoping later genuinely
|
|
implements provider routing as a product feature rather than using
|
|
`llm-connect` as optional extraction infrastructure.
|
|
|
|
Acceptance criteria:
|
|
- The golden profile includes ability, capability, feature, and evidence
|
|
expectations with source paths.
|
|
- The profile distinguishes native utility from dependencies, fixtures, test
|
|
vocabulary, schema examples, and optional LLM extraction infrastructure.
|
|
- The profile is stored in a stable, reviewable fixture location.
|
|
- The profile can evolve through explicit assessment decisions.
|
|
|
|
Implementation note 2026-05-15: added
|
|
`docs/schemas/self-scoping-assessment.schema.json`,
|
|
`docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json`,
|
|
`docs/self-scoping/golden/repo-scoping-golden-profile.v1.json`, and
|
|
`tests/test_self_scoping_artifacts.py`. The known-bad artifact is marked as a
|
|
negative regression seed with `historical_incomplete` release binding because
|
|
the original analysis run did not record the engine commit.
|
|
|
|
## T04: Export Assessment Artifacts From Analysis Runs
|
|
|
|
```task
|
|
id: RREG-WP-0013-T04
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"
|
|
```
|
|
|
|
Add a CLI and/or API workflow that exports a completed analysis run as a
|
|
self-scoping assessment artifact.
|
|
|
|
Acceptance criteria:
|
|
- Export includes repository metadata, analysis run metadata, engine identity,
|
|
candidate graph, observed fact summary, content chunk summary, approved map
|
|
if present, review decisions, and quality-gate outcomes when available.
|
|
- Export format is deterministic JSON with a documented schema.
|
|
- Export refuses to mark an artifact comparable when engine identity is
|
|
incomplete.
|
|
- Export can target repo-scoping itself without requiring network access.
|
|
|
|
Implementation note 2026-05-15: added
|
|
`src/repo_registry/self_scoping/assessment.py` and the
|
|
`repo-scoping export-assessment` CLI command. The exporter reads an existing
|
|
completed analysis run, records engine identity, generated candidate tree,
|
|
approved map, fact/content summaries, review decisions, empty quality-gate
|
|
outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover
|
|
the exporter and CLI path.
|
|
|
|
## T05: Compare Baseline And Challenger Runs
|
|
|
|
```task
|
|
id: RREG-WP-0013-T05
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"
|
|
```
|
|
|
|
Implement comparison between an existing baseline and a later challenger run.
|
|
|
|
Comparison should report:
|
|
|
|
- Added, removed, renamed, and moved abilities/capabilities/features.
|
|
- Hierarchy quality changes, especially misplaced features under the wrong
|
|
capability.
|
|
- Native-utility precision: whether generated capabilities are repo-owned,
|
|
facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
|
|
- Coverage against the repo-scoping golden profile.
|
|
- Regression flags for known-bad patterns.
|
|
- Source-ref quality: whether claims cite product intent, docs, source, tests,
|
|
fixtures, examples, or generated/derived scope.
|
|
|
|
Acceptance criteria:
|
|
- Comparison output is useful in both machine-readable JSON and human-readable
|
|
Markdown.
|
|
- The report makes it easy to choose "old better", "new better", "tie", or
|
|
"needs review".
|
|
- It does not require candidates to have stable database IDs across runs.
|
|
- It can compare deterministic-only and agent-reviewed runs without losing
|
|
provenance.
|
|
|
|
Implementation note 2026-05-15: added
|
|
`src/repo_registry/self_scoping/comparison.py` and the
|
|
`repo-scoping compare-assessment` CLI command. The first comparison report
|
|
checks assessment artifacts against the repo-scoping golden profile, reports
|
|
missing expected capabilities, forbidden native capability matches, known
|
|
regression patterns, and misplaced API/CLI features under provider-routing
|
|
capabilities. Reports can be emitted as JSON or Markdown.
|
|
|
|
## T06: Add Side-By-Side Review UI
|
|
|
|
```task
|
|
id: RREG-WP-0013-T06
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"
|
|
```
|
|
|
|
Expose baseline/challenger comparison in the curator UI.
|
|
|
|
Acceptance criteria:
|
|
- Reviewers can select two assessment artifacts for repo-scoping.
|
|
- The UI shows the two hierarchy trees side by side with moved/misplaced items
|
|
highlighted.
|
|
- Reviewers can record preference, tie, rejection, and notes.
|
|
- Review decisions are persisted as assessment outcomes, not as changes to the
|
|
underlying historical artifacts.
|
|
|
|
Implementation note 2026-05-15: added a file-backed `/ui/self-scoping` curator
|
|
surface that reads golden profiles and assessment artifacts from
|
|
`docs/self-scoping`, renders side-by-side hierarchy comparisons with regression
|
|
highlights, compares two assessment runs directly for old-vs-new judgement, and
|
|
records append-only review outcome JSON under `docs/self-scoping/outcomes/`.
|
|
|
|
## T07: Add Self-Scoping Regression Command
|
|
|
|
```task
|
|
id: RREG-WP-0013-T07
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"
|
|
```
|
|
|
|
Add a repeatable command for running repo-scoping against itself and comparing
|
|
the result to the active baseline.
|
|
|
|
Acceptance criteria:
|
|
- The command captures engine identity before running analysis.
|
|
- The command can run deterministic-only without LLM or agentic review.
|
|
- The command can optionally invoke agentic review when configured.
|
|
- The command emits a comparison report and exits non-zero only for explicit
|
|
CI-blocking regressions, not for ordinary "needs review" assessment outcomes.
|
|
|
|
Implementation note 2026-05-15: added `repo-scoping self-assess`. The command
|
|
analyzes a source tree, exports a challenger assessment artifact, compares it to
|
|
the golden profile, emits JSON or Markdown, and returns non-zero only with
|
|
`--fail-on-regression` when the comparison status is `regression`. The command
|
|
defaults to deterministic-only; `--with-llm` opts into configured LLM assistance.
|
|
`--agentic-review` now records an agentic-review request and leaves candidates
|
|
pending when no agentic reviewer is configured.
|
|
|
|
## T08: Document Assessment Workflow
|
|
|
|
```task
|
|
id: RREG-WP-0013-T08
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"
|
|
```
|
|
|
|
Document how maintainers should use self-scoping assessment artifacts while
|
|
evolving the engine.
|
|
|
|
Acceptance criteria:
|
|
- Documentation explains baseline, challenger, preferred, tied, rejected, and
|
|
superseded outcomes.
|
|
- Documentation explains engine release binding and why unbound output is not
|
|
comparable.
|
|
- Documentation gives examples for the known-bad LLM-provider regression and a
|
|
desired native repo-scoping profile.
|
|
- Documentation describes when to update the golden profile versus when to fix
|
|
the engine.
|
|
|
|
Implementation note 2026-05-15: added `docs/self-scoping/workflow.md`. The
|
|
workflow documents assessment outcomes, release binding, the standard
|
|
self-assessment loop, CI use, when to update the golden profile, when to fix the
|
|
engine, and the relationship to RREG-WP-0014 agentic acceptance.
|
|
|
|
## Completion Criteria
|
|
|
|
- repo-scoping has an immutable, release-bound self-scoping assessment format.
|
|
- The current known-bad output is captured as a negative regression seed.
|
|
- A curated desired repo-scoping profile exists.
|
|
- Maintainers can rerun repo-scoping on itself, compare old/new results, and
|
|
record which output is better.
|
|
- Comparison results are bound to the repo-scoping release that generated them.
|