Self-Scoping Assessment Workflow

Self-scoping is the feedback loop for improving repo-scoping with evidence. The loop is simple: run the current engine against repo-scoping itself, compare the result to a curated golden profile and known bad runs, then record whether the new result is better.

Outcome Terms

baseline: a result accepted as a reference point for later comparisons.
challenger: a fresh result from a new engine version or configuration.
preferred: the reviewer chose this result over the prior baseline.
tied: the reviewer judged old and new results roughly equivalent.
rejected: the result is known bad and should not become baseline truth.
superseded: the result used to be useful but was replaced by a newer preferred assessment.
needs-human: the result cannot be judged confidently without curator review.

The known 2026-05-15 run 39 artifact is a rejected negative regression seed, not a baseline to imitate.

Release Binding

Assessment output is only useful if it is bound to the engine that generated it. Comparable challenger artifacts should record:

repo-scoping package version
engine git commit
engine release or tag when available
engine dirty state
scanner version
candidate generator version
quality criteria version
prompt version when LLM or agentic review is used

An artifact with release_binding_status=complete can be compared as a real challenger. An artifact with historical_incomplete can still be useful as a negative seed, but it should not become a preferred baseline. An unbound artifact is diagnostic only.

Dirty state does not automatically make an artifact useless, but it must be visible. A dirty challenger should usually be rerun after the relevant changes are committed.

Standard Loop

Run the self-assessment command:

repo-scoping self-assess \
  --source-path . \
  --assessment-output docs/self-scoping/assessments/repo-scoping-challenger.json \
  --comparison-output docs/self-scoping/assessments/repo-scoping-challenger.md

Read the comparison report.
If the report says regression, inspect forbidden capabilities, misplaced features, and known regression patterns first.
If the report says needs_review, inspect missing expected capabilities and source evidence before choosing old or new output.
If the report says candidate_improvement, still confirm that the hierarchy, source refs, and native-utility boundaries make sense.
Record the decision as an assessment outcome before changing the active baseline.

CI Use

Use --fail-on-regression only when regressions should block the command:

repo-scoping self-assess \
  --source-path . \
  --comparison-output /tmp/repo-scoping-self-assessment.md \
  --fail-on-regression

The command should not fail for ordinary needs_review results. Review-needed output is signal, not a broken build.

Updating The Golden Profile

Update golden/repo-scoping-golden-profile.v1.json when the repository's real product utility has changed. Examples:

repo-scoping adds a genuinely new user-facing capability.
a capability is renamed after curator agreement.
a former out-of-scope behavior becomes product intent and has supporting implementation evidence.

Do not update the golden profile just because the engine failed to find an expected capability. That is usually an engine issue.

Fixing The Engine

Fix the engine when a challenger:

repeats a known regression pattern
promotes dependency, fixture, schema, scanner-rule, or workplan vocabulary as native capability truth
places features under a capability they do not support
loses source refs or cites evidence that does not support the abstraction
relies on generated SCOPE.md as primary proof for rebuilding the same model

The 2026-05-15 run 39 failure is the canonical example: provider vocabulary from scanner code, tests, fixtures, and schema examples became the false native capability Route LLM Requests Across Providers. The correct action is to fix scanner/generator/acceptance behavior, not to teach the golden profile that repo-scoping is an LLM router.

Relationship To Agentic Acceptance

Deterministic assessment can reject, downgrade, or flag output with transparent criteria. It should not approve candidate characteristics as registry truth. When automation stands in for human review, the decision belongs to an agentic reviewer that inspects evidence, applies versioned criteria, and records a rationale. That acceptance redesign is tracked in RREG-WP-0014.

4.5 KiB Raw Blame History