Files

tegwick 8f484cd855 Route auto review requests to agentic review

2026-05-15 15:53:52 +02:00

12 KiB

Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
RREG-WP-0013	workplan	Self-Scoping Baseline Evaluation	capabilities	repo-scoping	done	codex	foerster-capabilities	2026-05-15	2026-05-15	1c740db0-1999-478b-b3e3-c0fdfec1e9dd

Self-Scoping Baseline Evaluation

repo-scoping should become a self-improving infrastructure: every meaningful change to the scoping engine should be testable against a known baseline for repo-scoping itself. The goal is not just to assert that output changed, but to make it easy for a human or trusted agent to decide whether an old or new result is better and preserve that assessment as signal for future engine iterations.

The motivating failure is the 2026-05-15 self-analysis where deterministic provider-vocabulary facts were promoted into an approved Route LLM Requests Across Providers capability and the repo's native API/CLI features were attached under that incorrect capability. Future reruns should make regressions like that obvious, reviewable, and attributable to the exact repo-scoping release that generated them.

T01: Define Self-Scoping Assessment Model

id: RREG-WP-0013-T01
status: done
priority: high
state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"

Define the data model for immutable self-scoping assessment runs.

Each assessment must bind together:

The target repository identity: repo slug, source URL/path, target commit, target branch, and dirty-state marker when applicable.
The engine identity: repo-scoping package version, git commit, git tag or release name when available, dirty-state marker, scanner version, candidate generator version, quality-gate/ruleset version, schema version, and prompt version/hash when LLM or agentic review is used.
The execution mode: deterministic-only, LLM-assisted, agent-reviewed, trusted-auto-review, manual-review, or mixed.
The generated artifacts: observed fact summary, candidate graph, approved map or proposed approval set, rejected/downgraded items, source refs, and review notes.
The assessment outcome: baseline, challenger, preferred, tied, rejected, superseded, or needs-human.

Acceptance criteria:

A documented schema exists for self-scoping assessment runs.
Assessment runs are append-only; reruns create new records instead of rewriting old judgements.
Engine release binding is required before an assessment can be compared.
Dirty working trees are visible in the assessment metadata.

T02: Capture Current Bad Self-Run As A Regression Seed

id: RREG-WP-0013-T02
status: done
priority: high
state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"

Import or recreate the known-bad repo-scoping self-analysis as a named regression seed.

Known bad pattern:

Candidate/approved capability: Route LLM Requests Across Providers.
Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that LLM-provider capability.
Incorrect evidence: scanner vocabulary, schema examples, tests, and provider-name normalization code treated as repo-owned LLM routing behavior.

Acceptance criteria:

The bad run can be inspected as a historical assessment artifact.
It is clearly marked as a negative baseline, not a desired golden output.
The failure explanation is stored next to the captured graph.
Future comparison reports can flag when a challenger repeats the same pattern.

T03: Create Desired Repo-Scoping Golden Profile

id: RREG-WP-0013-T03
status: done
priority: high
state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"

Author a curated golden profile for repo-scoping itself. This should be compact enough for comparison but expressive enough to catch hierarchy errors.

Expected native capabilities should cover at least:

Repository registration and metadata import.
Deterministic repository scanning into observed facts.
Source-role and provenance-aware content indexing.
Candidate characteristic generation from facts and content.
Candidate review, edit, reject, merge, relink, and approval workflow.
Approved characteristic search, comparison, export, and capability-gap exploration.
SCOPE.md generation, diffing, validation, and write/update flows.
Dependency graph and characteristic impact exploration.
Scope context API support for downstream agents such as activity-core.

Forbidden top-level/native capabilities should include:

Route LLM Requests Across Providers, unless repo-scoping later genuinely implements provider routing as a product feature rather than using llm-connect as optional extraction infrastructure.

Acceptance criteria:

The golden profile includes ability, capability, feature, and evidence expectations with source paths.
The profile distinguishes native utility from dependencies, fixtures, test vocabulary, schema examples, and optional LLM extraction infrastructure.
The profile is stored in a stable, reviewable fixture location.
The profile can evolve through explicit assessment decisions.

Implementation note 2026-05-15: added docs/schemas/self-scoping-assessment.schema.json, docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json, docs/self-scoping/golden/repo-scoping-golden-profile.v1.json, and tests/test_self_scoping_artifacts.py. The known-bad artifact is marked as a negative regression seed with historical_incomplete release binding because the original analysis run did not record the engine commit.

T04: Export Assessment Artifacts From Analysis Runs

id: RREG-WP-0013-T04
status: done
priority: high
state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"

Add a CLI and/or API workflow that exports a completed analysis run as a self-scoping assessment artifact.

Acceptance criteria:

Export includes repository metadata, analysis run metadata, engine identity, candidate graph, observed fact summary, content chunk summary, approved map if present, review decisions, and quality-gate outcomes when available.
Export format is deterministic JSON with a documented schema.
Export refuses to mark an artifact comparable when engine identity is incomplete.
Export can target repo-scoping itself without requiring network access.

Implementation note 2026-05-15: added src/repo_registry/self_scoping/assessment.py and the repo-scoping export-assessment CLI command. The exporter reads an existing completed analysis run, records engine identity, generated candidate tree, approved map, fact/content summaries, review decisions, empty quality-gate outcomes pending RREG-WP-0014, and known regression patterns. Focused tests cover the exporter and CLI path.

T05: Compare Baseline And Challenger Runs

id: RREG-WP-0013-T05
status: done
priority: high
state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"

Implement comparison between an existing baseline and a later challenger run.

Comparison should report:

Added, removed, renamed, and moved abilities/capabilities/features.
Hierarchy quality changes, especially misplaced features under the wrong capability.
Native-utility precision: whether generated capabilities are repo-owned, facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
Coverage against the repo-scoping golden profile.
Regression flags for known-bad patterns.
Source-ref quality: whether claims cite product intent, docs, source, tests, fixtures, examples, or generated/derived scope.

Acceptance criteria:

Comparison output is useful in both machine-readable JSON and human-readable Markdown.
The report makes it easy to choose "old better", "new better", "tie", or "needs review".
It does not require candidates to have stable database IDs across runs.
It can compare deterministic-only and agent-reviewed runs without losing provenance.

Implementation note 2026-05-15: added src/repo_registry/self_scoping/comparison.py and the repo-scoping compare-assessment CLI command. The first comparison report checks assessment artifacts against the repo-scoping golden profile, reports missing expected capabilities, forbidden native capability matches, known regression patterns, and misplaced API/CLI features under provider-routing capabilities. Reports can be emitted as JSON or Markdown.

T06: Add Side-By-Side Review UI

id: RREG-WP-0013-T06
status: done
priority: medium
state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"

Expose baseline/challenger comparison in the curator UI.

Acceptance criteria:

Reviewers can select two assessment artifacts for repo-scoping.
The UI shows the two hierarchy trees side by side with moved/misplaced items highlighted.
Reviewers can record preference, tie, rejection, and notes.
Review decisions are persisted as assessment outcomes, not as changes to the underlying historical artifacts.

Implementation note 2026-05-15: added a file-backed /ui/self-scoping curator surface that reads golden profiles and assessment artifacts from docs/self-scoping, renders side-by-side hierarchy comparisons with regression highlights, compares two assessment runs directly for old-vs-new judgement, and records append-only review outcome JSON under docs/self-scoping/outcomes/.

T07: Add Self-Scoping Regression Command

id: RREG-WP-0013-T07
status: done
priority: medium
state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"

Add a repeatable command for running repo-scoping against itself and comparing the result to the active baseline.

Acceptance criteria:

The command captures engine identity before running analysis.
The command can run deterministic-only without LLM or agentic review.
The command can optionally invoke agentic review when configured.
The command emits a comparison report and exits non-zero only for explicit CI-blocking regressions, not for ordinary "needs review" assessment outcomes.

Implementation note 2026-05-15: added repo-scoping self-assess. The command analyzes a source tree, exports a challenger assessment artifact, compares it to the golden profile, emits JSON or Markdown, and returns non-zero only with --fail-on-regression when the comparison status is regression. The command defaults to deterministic-only; --with-llm opts into configured LLM assistance. --agentic-review now records an agentic-review request and leaves candidates pending when no agentic reviewer is configured.

T08: Document Assessment Workflow

id: RREG-WP-0013-T08
status: done
priority: medium
state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"

Document how maintainers should use self-scoping assessment artifacts while evolving the engine.

Acceptance criteria:

Documentation explains baseline, challenger, preferred, tied, rejected, and superseded outcomes.
Documentation explains engine release binding and why unbound output is not comparable.
Documentation gives examples for the known-bad LLM-provider regression and a desired native repo-scoping profile.
Documentation describes when to update the golden profile versus when to fix the engine.

Implementation note 2026-05-15: added docs/self-scoping/workflow.md. The workflow documents assessment outcomes, release binding, the standard self-assessment loop, CI use, when to update the golden profile, when to fix the engine, and the relationship to RREG-WP-0014 agentic acceptance.

Completion Criteria

repo-scoping has an immutable, release-bound self-scoping assessment format.
The current known-bad output is captured as a negative regression seed.
A curated desired repo-scoping profile exists.
Maintainers can rerun repo-scoping on itself, compare old/new results, and record which output is better.
Comparison results are bound to the repo-scoping release that generated them.

12 KiB Raw Permalink Blame History

Self-Scoping Baseline Evaluation

T01: Define Self-Scoping Assessment Model

T02: Capture Current Bad Self-Run As A Regression Seed

T03: Create Desired Repo-Scoping Golden Profile

T04: Export Assessment Artifacts From Analysis Runs

T05: Compare Baseline And Challenger Runs

T06: Add Side-By-Side Review UI

T07: Add Self-Scoping Regression Command

T08: Document Assessment Workflow

Completion Criteria

12 KiB

Raw Permalink Blame History