generated from coulomb/repo-seed
130 lines
4.9 KiB
Markdown
130 lines
4.9 KiB
Markdown
# Self-Scoping Assessment Workflow
|
|
|
|
Self-scoping is the feedback loop for improving repo-scoping with evidence. The
|
|
loop is simple: run the current engine against repo-scoping itself, compare the
|
|
result to a curated golden profile and known bad runs, then record whether the
|
|
new result is better.
|
|
|
|
## Outcome Terms
|
|
|
|
- `baseline`: a result accepted as a reference point for later comparisons.
|
|
- `challenger`: a fresh result from a new engine version or configuration.
|
|
- `preferred`: the reviewer chose this result over the prior baseline.
|
|
- `tied`: the reviewer judged old and new results roughly equivalent.
|
|
- `rejected`: the result is known bad and should not become baseline truth.
|
|
- `superseded`: the result used to be useful but was replaced by a newer
|
|
preferred assessment.
|
|
- `needs-human`: the result cannot be judged confidently without curator
|
|
review.
|
|
|
|
The known 2026-05-15 run 39 artifact is a `rejected` negative regression seed,
|
|
not a baseline to imitate.
|
|
|
|
## Release Binding
|
|
|
|
Assessment output is only useful if it is bound to the engine that generated it.
|
|
Comparable challenger artifacts should record:
|
|
|
|
- repo-scoping package version
|
|
- engine git commit
|
|
- engine release or tag when available
|
|
- engine dirty state
|
|
- scanner version
|
|
- candidate generator version
|
|
- quality criteria version
|
|
- prompt version when LLM or agentic review is used
|
|
|
|
An artifact with `release_binding_status=complete` can be compared as a real
|
|
challenger. An artifact with `historical_incomplete` can still be useful as a
|
|
negative seed, but it should not become a preferred baseline. An `unbound`
|
|
artifact is diagnostic only.
|
|
|
|
Dirty state does not automatically make an artifact useless, but it must be
|
|
visible. A dirty challenger should usually be rerun after the relevant changes
|
|
are committed.
|
|
|
|
## Standard Loop
|
|
|
|
1. Run the self-assessment command:
|
|
|
|
```bash
|
|
repo-scoping self-assess \
|
|
--source-path . \
|
|
--assessment-output docs/self-scoping/assessments/repo-scoping-challenger.json \
|
|
--comparison-output docs/self-scoping/assessments/repo-scoping-challenger.md
|
|
```
|
|
|
|
2. Read the comparison report.
|
|
|
|
3. Open the curator UI at `/ui/self-scoping` to compare the golden profile and
|
|
assessment artifact side by side.
|
|
|
|
4. When an earlier baseline assessment exists, use the same page's two-run
|
|
comparison to judge old output against the new challenger.
|
|
|
|
5. If the report says `regression`, inspect forbidden capabilities, misplaced
|
|
features, and known regression patterns first.
|
|
|
|
6. If the report says `needs_review`, inspect missing expected capabilities and
|
|
source evidence before choosing old or new output.
|
|
|
|
7. If the report says `candidate_improvement`, still confirm that the
|
|
hierarchy, source refs, and native-utility boundaries make sense.
|
|
|
|
8. Record the decision as an assessment outcome before changing the active
|
|
baseline. The UI writes append-only outcome records under
|
|
`docs/self-scoping/outcomes/`; it does not rewrite historical assessment or
|
|
golden-profile artifacts.
|
|
|
|
## CI Use
|
|
|
|
Use `--fail-on-regression` only when regressions should block the command:
|
|
|
|
```bash
|
|
repo-scoping self-assess \
|
|
--source-path . \
|
|
--comparison-output /tmp/repo-scoping-self-assessment.md \
|
|
--fail-on-regression
|
|
```
|
|
|
|
The command should not fail for ordinary `needs_review` results. Review-needed
|
|
output is signal, not a broken build.
|
|
|
|
## Updating The Golden Profile
|
|
|
|
Update `golden/repo-scoping-golden-profile.v1.json` when the repository's real
|
|
product utility has changed. Examples:
|
|
|
|
- repo-scoping adds a genuinely new user-facing capability.
|
|
- a capability is renamed after curator agreement.
|
|
- a former out-of-scope behavior becomes product intent and has supporting
|
|
implementation evidence.
|
|
|
|
Do not update the golden profile just because the engine failed to find an
|
|
expected capability. That is usually an engine issue.
|
|
|
|
## Fixing The Engine
|
|
|
|
Fix the engine when a challenger:
|
|
|
|
- repeats a known regression pattern
|
|
- promotes dependency, fixture, schema, scanner-rule, or workplan vocabulary as
|
|
native capability truth
|
|
- places features under a capability they do not support
|
|
- loses source refs or cites evidence that does not support the abstraction
|
|
- relies on generated `SCOPE.md` as primary proof for rebuilding the same model
|
|
|
|
The 2026-05-15 run 39 failure is the canonical example: provider vocabulary from
|
|
scanner code, tests, fixtures, and schema examples became the false native
|
|
capability `Route LLM Requests Across Providers`. The correct action is to fix
|
|
scanner/generator/acceptance behavior, not to teach the golden profile that
|
|
repo-scoping is an LLM router.
|
|
|
|
## Relationship To Agentic Acceptance
|
|
|
|
Deterministic assessment can reject, downgrade, or flag output with transparent
|
|
criteria. It should not approve candidate characteristics as registry truth.
|
|
When automation stands in for human review, the decision belongs to an agentic
|
|
reviewer that inspects evidence, applies versioned criteria, and records a
|
|
rationale. That acceptance redesign is tracked in `RREG-WP-0014`.
|