12 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| RREG-WP-0013 | workplan | Self-Scoping Baseline Evaluation | capabilities | repo-scoping | done | codex | foerster-capabilities | 2026-05-15 | 2026-05-15 | 1c740db0-1999-478b-b3e3-c0fdfec1e9dd |
Self-Scoping Baseline Evaluation
repo-scoping should become a self-improving infrastructure: every meaningful change to the scoping engine should be testable against a known baseline for repo-scoping itself. The goal is not just to assert that output changed, but to make it easy for a human or trusted agent to decide whether an old or new result is better and preserve that assessment as signal for future engine iterations.
The motivating failure is the 2026-05-15 self-analysis where deterministic
provider-vocabulary facts were promoted into an approved Route LLM Requests Across Providers capability and the repo's native API/CLI features were
attached under that incorrect capability. Future reruns should make regressions
like that obvious, reviewable, and attributable to the exact repo-scoping
release that generated them.
T01: Define Self-Scoping Assessment Model
id: RREG-WP-0013-T01
status: done
priority: high
state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"
Define the data model for immutable self-scoping assessment runs.
Each assessment must bind together:
- The target repository identity: repo slug, source URL/path, target commit, target branch, and dirty-state marker when applicable.
- The engine identity: repo-scoping package version, git commit, git tag or release name when available, dirty-state marker, scanner version, candidate generator version, quality-gate/ruleset version, schema version, and prompt version/hash when LLM or agentic review is used.
- The execution mode: deterministic-only, LLM-assisted, agent-reviewed, trusted-auto-review, manual-review, or mixed.
- The generated artifacts: observed fact summary, candidate graph, approved map or proposed approval set, rejected/downgraded items, source refs, and review notes.
- The assessment outcome: baseline, challenger, preferred, tied, rejected, superseded, or needs-human.
Acceptance criteria:
- A documented schema exists for self-scoping assessment runs.
- Assessment runs are append-only; reruns create new records instead of rewriting old judgements.
- Engine release binding is required before an assessment can be compared.
- Dirty working trees are visible in the assessment metadata.
T02: Capture Current Bad Self-Run As A Regression Seed
id: RREG-WP-0013-T02
status: done
priority: high
state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"
Import or recreate the known-bad repo-scoping self-analysis as a named regression seed.
Known bad pattern:
- Candidate/approved capability:
Route LLM Requests Across Providers. - Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that LLM-provider capability.
- Incorrect evidence: scanner vocabulary, schema examples, tests, and provider-name normalization code treated as repo-owned LLM routing behavior.
Acceptance criteria:
- The bad run can be inspected as a historical assessment artifact.
- It is clearly marked as a negative baseline, not a desired golden output.
- The failure explanation is stored next to the captured graph.
- Future comparison reports can flag when a challenger repeats the same pattern.
T03: Create Desired Repo-Scoping Golden Profile
id: RREG-WP-0013-T03
status: done
priority: high
state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"
Author a curated golden profile for repo-scoping itself. This should be compact enough for comparison but expressive enough to catch hierarchy errors.
Expected native capabilities should cover at least:
- Repository registration and metadata import.
- Deterministic repository scanning into observed facts.
- Source-role and provenance-aware content indexing.
- Candidate characteristic generation from facts and content.
- Candidate review, edit, reject, merge, relink, and approval workflow.
- Approved characteristic search, comparison, export, and capability-gap exploration.
- SCOPE.md generation, diffing, validation, and write/update flows.
- Dependency graph and characteristic impact exploration.
- Scope context API support for downstream agents such as activity-core.
Forbidden top-level/native capabilities should include:
Route LLM Requests Across Providers, unless repo-scoping later genuinely implements provider routing as a product feature rather than usingllm-connectas optional extraction infrastructure.
Acceptance criteria:
- The golden profile includes ability, capability, feature, and evidence expectations with source paths.
- The profile distinguishes native utility from dependencies, fixtures, test vocabulary, schema examples, and optional LLM extraction infrastructure.
- The profile is stored in a stable, reviewable fixture location.
- The profile can evolve through explicit assessment decisions.
Implementation note 2026-05-15: added
docs/schemas/self-scoping-assessment.schema.json,
docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json,
docs/self-scoping/golden/repo-scoping-golden-profile.v1.json, and
tests/test_self_scoping_artifacts.py. The known-bad artifact is marked as a
negative regression seed with historical_incomplete release binding because
the original analysis run did not record the engine commit.
T04: Export Assessment Artifacts From Analysis Runs
id: RREG-WP-0013-T04
status: done
priority: high
state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"
Add a CLI and/or API workflow that exports a completed analysis run as a self-scoping assessment artifact.
Acceptance criteria:
- Export includes repository metadata, analysis run metadata, engine identity, candidate graph, observed fact summary, content chunk summary, approved map if present, review decisions, and quality-gate outcomes when available.
- Export format is deterministic JSON with a documented schema.
- Export refuses to mark an artifact comparable when engine identity is incomplete.
- Export can target repo-scoping itself without requiring network access.
Implementation note 2026-05-15: added
src/repo_registry/self_scoping/assessment.py and the
repo-scoping export-assessment CLI command. The exporter reads an existing
completed analysis run, records engine identity, generated candidate tree,
approved map, fact/content summaries, review decisions, empty quality-gate
outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover
the exporter and CLI path.
T05: Compare Baseline And Challenger Runs
id: RREG-WP-0013-T05
status: done
priority: high
state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"
Implement comparison between an existing baseline and a later challenger run.
Comparison should report:
- Added, removed, renamed, and moved abilities/capabilities/features.
- Hierarchy quality changes, especially misplaced features under the wrong capability.
- Native-utility precision: whether generated capabilities are repo-owned, facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
- Coverage against the repo-scoping golden profile.
- Regression flags for known-bad patterns.
- Source-ref quality: whether claims cite product intent, docs, source, tests, fixtures, examples, or generated/derived scope.
Acceptance criteria:
- Comparison output is useful in both machine-readable JSON and human-readable Markdown.
- The report makes it easy to choose "old better", "new better", "tie", or "needs review".
- It does not require candidates to have stable database IDs across runs.
- It can compare deterministic-only and agent-reviewed runs without losing provenance.
Implementation note 2026-05-15: added
src/repo_registry/self_scoping/comparison.py and the
repo-scoping compare-assessment CLI command. The first comparison report
checks assessment artifacts against the repo-scoping golden profile, reports
missing expected capabilities, forbidden native capability matches, known
regression patterns, and misplaced API/CLI features under provider-routing
capabilities. Reports can be emitted as JSON or Markdown.
T06: Add Side-By-Side Review UI
id: RREG-WP-0013-T06
status: done
priority: medium
state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"
Expose baseline/challenger comparison in the curator UI.
Acceptance criteria:
- Reviewers can select two assessment artifacts for repo-scoping.
- The UI shows the two hierarchy trees side by side with moved/misplaced items highlighted.
- Reviewers can record preference, tie, rejection, and notes.
- Review decisions are persisted as assessment outcomes, not as changes to the underlying historical artifacts.
Implementation note 2026-05-15: added a file-backed /ui/self-scoping curator
surface that reads golden profiles and assessment artifacts from
docs/self-scoping, renders side-by-side hierarchy comparisons with regression
highlights, compares two assessment runs directly for old-vs-new judgement, and
records append-only review outcome JSON under docs/self-scoping/outcomes/.
T07: Add Self-Scoping Regression Command
id: RREG-WP-0013-T07
status: done
priority: medium
state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"
Add a repeatable command for running repo-scoping against itself and comparing the result to the active baseline.
Acceptance criteria:
- The command captures engine identity before running analysis.
- The command can run deterministic-only without LLM or agentic review.
- The command can optionally invoke agentic review when configured.
- The command emits a comparison report and exits non-zero only for explicit CI-blocking regressions, not for ordinary "needs review" assessment outcomes.
Implementation note 2026-05-15: added repo-scoping self-assess. The command
analyzes a source tree, exports a challenger assessment artifact, compares it to
the golden profile, emits JSON or Markdown, and returns non-zero only with
--fail-on-regression when the comparison status is regression. The command
defaults to deterministic-only; --with-llm opts into configured LLM assistance.
--agentic-review now records an agentic-review request and leaves candidates
pending when no agentic reviewer is configured.
T08: Document Assessment Workflow
id: RREG-WP-0013-T08
status: done
priority: medium
state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"
Document how maintainers should use self-scoping assessment artifacts while evolving the engine.
Acceptance criteria:
- Documentation explains baseline, challenger, preferred, tied, rejected, and superseded outcomes.
- Documentation explains engine release binding and why unbound output is not comparable.
- Documentation gives examples for the known-bad LLM-provider regression and a desired native repo-scoping profile.
- Documentation describes when to update the golden profile versus when to fix the engine.
Implementation note 2026-05-15: added docs/self-scoping/workflow.md. The
workflow documents assessment outcomes, release binding, the standard
self-assessment loop, CI use, when to update the golden profile, when to fix the
engine, and the relationship to RREG-WP-0014 agentic acceptance.
Completion Criteria
- repo-scoping has an immutable, release-bound self-scoping assessment format.
- The current known-bad output is captured as a negative regression seed.
- A curated desired repo-scoping profile exists.
- Maintainers can rerun repo-scoping on itself, compare old/new results, and record which output is better.
- Comparison results are bound to the repo-scoping release that generated them.