Files
repo-scoping/workplans/RREG-WP-0013-self-scoping-baseline-evaluation.md

12 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
RREG-WP-0013 workplan Self-Scoping Baseline Evaluation capabilities repo-scoping done codex foerster-capabilities 2026-05-15 2026-05-15 1c740db0-1999-478b-b3e3-c0fdfec1e9dd

Self-Scoping Baseline Evaluation

repo-scoping should become a self-improving infrastructure: every meaningful change to the scoping engine should be testable against a known baseline for repo-scoping itself. The goal is not just to assert that output changed, but to make it easy for a human or trusted agent to decide whether an old or new result is better and preserve that assessment as signal for future engine iterations.

The motivating failure is the 2026-05-15 self-analysis where deterministic provider-vocabulary facts were promoted into an approved Route LLM Requests Across Providers capability and the repo's native API/CLI features were attached under that incorrect capability. Future reruns should make regressions like that obvious, reviewable, and attributable to the exact repo-scoping release that generated them.

T01: Define Self-Scoping Assessment Model

id: RREG-WP-0013-T01
status: done
priority: high
state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"

Define the data model for immutable self-scoping assessment runs.

Each assessment must bind together:

  • The target repository identity: repo slug, source URL/path, target commit, target branch, and dirty-state marker when applicable.
  • The engine identity: repo-scoping package version, git commit, git tag or release name when available, dirty-state marker, scanner version, candidate generator version, quality-gate/ruleset version, schema version, and prompt version/hash when LLM or agentic review is used.
  • The execution mode: deterministic-only, LLM-assisted, agent-reviewed, trusted-auto-review, manual-review, or mixed.
  • The generated artifacts: observed fact summary, candidate graph, approved map or proposed approval set, rejected/downgraded items, source refs, and review notes.
  • The assessment outcome: baseline, challenger, preferred, tied, rejected, superseded, or needs-human.

Acceptance criteria:

  • A documented schema exists for self-scoping assessment runs.
  • Assessment runs are append-only; reruns create new records instead of rewriting old judgements.
  • Engine release binding is required before an assessment can be compared.
  • Dirty working trees are visible in the assessment metadata.

T02: Capture Current Bad Self-Run As A Regression Seed

id: RREG-WP-0013-T02
status: done
priority: high
state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"

Import or recreate the known-bad repo-scoping self-analysis as a named regression seed.

Known bad pattern:

  • Candidate/approved capability: Route LLM Requests Across Providers.
  • Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that LLM-provider capability.
  • Incorrect evidence: scanner vocabulary, schema examples, tests, and provider-name normalization code treated as repo-owned LLM routing behavior.

Acceptance criteria:

  • The bad run can be inspected as a historical assessment artifact.
  • It is clearly marked as a negative baseline, not a desired golden output.
  • The failure explanation is stored next to the captured graph.
  • Future comparison reports can flag when a challenger repeats the same pattern.

T03: Create Desired Repo-Scoping Golden Profile

id: RREG-WP-0013-T03
status: done
priority: high
state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"

Author a curated golden profile for repo-scoping itself. This should be compact enough for comparison but expressive enough to catch hierarchy errors.

Expected native capabilities should cover at least:

  • Repository registration and metadata import.
  • Deterministic repository scanning into observed facts.
  • Source-role and provenance-aware content indexing.
  • Candidate characteristic generation from facts and content.
  • Candidate review, edit, reject, merge, relink, and approval workflow.
  • Approved characteristic search, comparison, export, and capability-gap exploration.
  • SCOPE.md generation, diffing, validation, and write/update flows.
  • Dependency graph and characteristic impact exploration.
  • Scope context API support for downstream agents such as activity-core.

Forbidden top-level/native capabilities should include:

  • Route LLM Requests Across Providers, unless repo-scoping later genuinely implements provider routing as a product feature rather than using llm-connect as optional extraction infrastructure.

Acceptance criteria:

  • The golden profile includes ability, capability, feature, and evidence expectations with source paths.
  • The profile distinguishes native utility from dependencies, fixtures, test vocabulary, schema examples, and optional LLM extraction infrastructure.
  • The profile is stored in a stable, reviewable fixture location.
  • The profile can evolve through explicit assessment decisions.

Implementation note 2026-05-15: added docs/schemas/self-scoping-assessment.schema.json, docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json, docs/self-scoping/golden/repo-scoping-golden-profile.v1.json, and tests/test_self_scoping_artifacts.py. The known-bad artifact is marked as a negative regression seed with historical_incomplete release binding because the original analysis run did not record the engine commit.

T04: Export Assessment Artifacts From Analysis Runs

id: RREG-WP-0013-T04
status: done
priority: high
state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"

Add a CLI and/or API workflow that exports a completed analysis run as a self-scoping assessment artifact.

Acceptance criteria:

  • Export includes repository metadata, analysis run metadata, engine identity, candidate graph, observed fact summary, content chunk summary, approved map if present, review decisions, and quality-gate outcomes when available.
  • Export format is deterministic JSON with a documented schema.
  • Export refuses to mark an artifact comparable when engine identity is incomplete.
  • Export can target repo-scoping itself without requiring network access.

Implementation note 2026-05-15: added src/repo_registry/self_scoping/assessment.py and the repo-scoping export-assessment CLI command. The exporter reads an existing completed analysis run, records engine identity, generated candidate tree, approved map, fact/content summaries, review decisions, empty quality-gate outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover the exporter and CLI path.

T05: Compare Baseline And Challenger Runs

id: RREG-WP-0013-T05
status: done
priority: high
state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"

Implement comparison between an existing baseline and a later challenger run.

Comparison should report:

  • Added, removed, renamed, and moved abilities/capabilities/features.
  • Hierarchy quality changes, especially misplaced features under the wrong capability.
  • Native-utility precision: whether generated capabilities are repo-owned, facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
  • Coverage against the repo-scoping golden profile.
  • Regression flags for known-bad patterns.
  • Source-ref quality: whether claims cite product intent, docs, source, tests, fixtures, examples, or generated/derived scope.

Acceptance criteria:

  • Comparison output is useful in both machine-readable JSON and human-readable Markdown.
  • The report makes it easy to choose "old better", "new better", "tie", or "needs review".
  • It does not require candidates to have stable database IDs across runs.
  • It can compare deterministic-only and agent-reviewed runs without losing provenance.

Implementation note 2026-05-15: added src/repo_registry/self_scoping/comparison.py and the repo-scoping compare-assessment CLI command. The first comparison report checks assessment artifacts against the repo-scoping golden profile, reports missing expected capabilities, forbidden native capability matches, known regression patterns, and misplaced API/CLI features under provider-routing capabilities. Reports can be emitted as JSON or Markdown.

T06: Add Side-By-Side Review UI

id: RREG-WP-0013-T06
status: done
priority: medium
state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"

Expose baseline/challenger comparison in the curator UI.

Acceptance criteria:

  • Reviewers can select two assessment artifacts for repo-scoping.
  • The UI shows the two hierarchy trees side by side with moved/misplaced items highlighted.
  • Reviewers can record preference, tie, rejection, and notes.
  • Review decisions are persisted as assessment outcomes, not as changes to the underlying historical artifacts.

Implementation note 2026-05-15: added a file-backed /ui/self-scoping curator surface that reads golden profiles and assessment artifacts from docs/self-scoping, renders side-by-side hierarchy comparisons with regression highlights, compares two assessment runs directly for old-vs-new judgement, and records append-only review outcome JSON under docs/self-scoping/outcomes/.

T07: Add Self-Scoping Regression Command

id: RREG-WP-0013-T07
status: done
priority: medium
state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"

Add a repeatable command for running repo-scoping against itself and comparing the result to the active baseline.

Acceptance criteria:

  • The command captures engine identity before running analysis.
  • The command can run deterministic-only without LLM or agentic review.
  • The command can optionally invoke agentic review when configured.
  • The command emits a comparison report and exits non-zero only for explicit CI-blocking regressions, not for ordinary "needs review" assessment outcomes.

Implementation note 2026-05-15: added repo-scoping self-assess. The command analyzes a source tree, exports a challenger assessment artifact, compares it to the golden profile, emits JSON or Markdown, and returns non-zero only with --fail-on-regression when the comparison status is regression. The command defaults to deterministic-only; --with-llm opts into configured LLM assistance. --agentic-review now records an agentic-review request and leaves candidates pending when no agentic reviewer is configured.

T08: Document Assessment Workflow

id: RREG-WP-0013-T08
status: done
priority: medium
state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"

Document how maintainers should use self-scoping assessment artifacts while evolving the engine.

Acceptance criteria:

  • Documentation explains baseline, challenger, preferred, tied, rejected, and superseded outcomes.
  • Documentation explains engine release binding and why unbound output is not comparable.
  • Documentation gives examples for the known-bad LLM-provider regression and a desired native repo-scoping profile.
  • Documentation describes when to update the golden profile versus when to fix the engine.

Implementation note 2026-05-15: added docs/self-scoping/workflow.md. The workflow documents assessment outcomes, release binding, the standard self-assessment loop, CI use, when to update the golden profile, when to fix the engine, and the relationship to RREG-WP-0014 agentic acceptance.

Completion Criteria

  • repo-scoping has an immutable, release-bound self-scoping assessment format.
  • The current known-bad output is captured as a negative regression seed.
  • A curated desired repo-scoping profile exists.
  • Maintainers can rerun repo-scoping on itself, compare old/new results, and record which output is better.
  • Comparison results are bound to the repo-scoping release that generated them.