Add self-scoping baseline workplans and artifacts

2026-05-15 12:26:36 +02:00
parent a6e1e2f16a
commit 90bae27237
7 changed files with 1592 additions and 0 deletions
--- a/workplans/RREG-WP-0013-self-scoping-baseline-evaluation.md
+++ b/workplans/RREG-WP-0013-self-scoping-baseline-evaluation.md
@@ -0,0 +1,258 @@
+---
+id: RREG-WP-0013
+type: workplan
+title: "Self-Scoping Baseline Evaluation"
+domain: capabilities
+repo: repo-scoping
+status: active
+owner: codex
+topic_slug: foerster-capabilities
+created: "2026-05-15"
+updated: "2026-05-15"
+state_hub_workstream_id: "1c740db0-1999-478b-b3e3-c0fdfec1e9dd"
+---
+
+# Self-Scoping Baseline Evaluation
+
+repo-scoping should become a self-improving infrastructure: every meaningful
+change to the scoping engine should be testable against a known baseline for
+repo-scoping itself. The goal is not just to assert that output changed, but to
+make it easy for a human or trusted agent to decide whether an old or new
+result is better and preserve that assessment as signal for future engine
+iterations.
+
+The motivating failure is the 2026-05-15 self-analysis where deterministic
+provider-vocabulary facts were promoted into an approved `Route LLM Requests
+Across Providers` capability and the repo's native API/CLI features were
+attached under that incorrect capability. Future reruns should make regressions
+like that obvious, reviewable, and attributable to the exact repo-scoping
+release that generated them.
+
+## T01: Define Self-Scoping Assessment Model
+
+```task
+id: RREG-WP-0013-T01
+status: done
+priority: high
+state_hub_task_id: "af633b76-3356-4480-8108-d996eeda5a31"
+```
+
+Define the data model for immutable self-scoping assessment runs.
+
+Each assessment must bind together:
+
+- The target repository identity: repo slug, source URL/path, target commit,
+  target branch, and dirty-state marker when applicable.
+- The engine identity: repo-scoping package version, git commit, git tag or
+  release name when available, dirty-state marker, scanner version, candidate
+  generator version, quality-gate/ruleset version, schema version, and prompt
+  version/hash when LLM or agentic review is used.
+- The execution mode: deterministic-only, LLM-assisted, agent-reviewed,
+  trusted-auto-review, manual-review, or mixed.
+- The generated artifacts: observed fact summary, candidate graph, approved map
+  or proposed approval set, rejected/downgraded items, source refs, and review
+  notes.
+- The assessment outcome: baseline, challenger, preferred, tied, rejected,
+  superseded, or needs-human.
+
+Acceptance criteria:
+- A documented schema exists for self-scoping assessment runs.
+- Assessment runs are append-only; reruns create new records instead of
+  rewriting old judgements.
+- Engine release binding is required before an assessment can be compared.
+- Dirty working trees are visible in the assessment metadata.
+
+## T02: Capture Current Bad Self-Run As A Regression Seed
+
+```task
+id: RREG-WP-0013-T02
+status: done
+priority: high
+state_hub_task_id: "98258aea-65bb-4709-921f-711c6cc6ee48"
+```
+
+Import or recreate the known-bad repo-scoping self-analysis as a named
+regression seed.
+
+Known bad pattern:
+
+- Candidate/approved capability: `Route LLM Requests Across Providers`.
+- Incorrect feature attachment: repo-scoping API/CLI surfaces nested under that
+  LLM-provider capability.
+- Incorrect evidence: scanner vocabulary, schema examples, tests, and
+  provider-name normalization code treated as repo-owned LLM routing behavior.
+
+Acceptance criteria:
+- The bad run can be inspected as a historical assessment artifact.
+- It is clearly marked as a negative baseline, not a desired golden output.
+- The failure explanation is stored next to the captured graph.
+- Future comparison reports can flag when a challenger repeats the same pattern.
+
+## T03: Create Desired Repo-Scoping Golden Profile
+
+```task
+id: RREG-WP-0013-T03
+status: done
+priority: high
+state_hub_task_id: "f3ef1711-a115-4368-a97e-98abd1eda521"
+```
+
+Author a curated golden profile for repo-scoping itself. This should be compact
+enough for comparison but expressive enough to catch hierarchy errors.
+
+Expected native capabilities should cover at least:
+
+- Repository registration and metadata import.
+- Deterministic repository scanning into observed facts.
+- Source-role and provenance-aware content indexing.
+- Candidate characteristic generation from facts and content.
+- Candidate review, edit, reject, merge, relink, and approval workflow.
+- Approved characteristic search, comparison, export, and capability-gap
+  exploration.
+- SCOPE.md generation, diffing, validation, and write/update flows.
+- Dependency graph and characteristic impact exploration.
+- Scope context API support for downstream agents such as activity-core.
+
+Forbidden top-level/native capabilities should include:
+
+- `Route LLM Requests Across Providers`, unless repo-scoping later genuinely
+  implements provider routing as a product feature rather than using
+  `llm-connect` as optional extraction infrastructure.
+
+Acceptance criteria:
+- The golden profile includes ability, capability, feature, and evidence
+  expectations with source paths.
+- The profile distinguishes native utility from dependencies, fixtures, test
+  vocabulary, schema examples, and optional LLM extraction infrastructure.
+- The profile is stored in a stable, reviewable fixture location.
+- The profile can evolve through explicit assessment decisions.
+
+Implementation note 2026-05-15: added
+`docs/schemas/self-scoping-assessment.schema.json`,
+`docs/self-scoping/assessments/repo-scoping-known-bad-2026-05-15-run-39.json`,
+`docs/self-scoping/golden/repo-scoping-golden-profile.v1.json`, and
+`tests/test_self_scoping_artifacts.py`. The known-bad artifact is marked as a
+negative regression seed with `historical_incomplete` release binding because
+the original analysis run did not record the engine commit.
+
+## T04: Export Assessment Artifacts From Analysis Runs
+
+```task
+id: RREG-WP-0013-T04
+status: todo
+priority: high
+state_hub_task_id: "51e01d45-7574-4c97-994d-dabb2bcf9a00"
+```
+
+Add a CLI and/or API workflow that exports a completed analysis run as a
+self-scoping assessment artifact.
+
+Acceptance criteria:
+- Export includes repository metadata, analysis run metadata, engine identity,
+  candidate graph, observed fact summary, content chunk summary, approved map
+  if present, review decisions, and quality-gate outcomes when available.
+- Export format is deterministic JSON with a documented schema.
+- Export refuses to mark an artifact comparable when engine identity is
+  incomplete.
+- Export can target repo-scoping itself without requiring network access.
+
+## T05: Compare Baseline And Challenger Runs
+
+```task
+id: RREG-WP-0013-T05
+status: todo
+priority: high
+state_hub_task_id: "2b71069b-6150-45f4-84a2-59f5ec1e04c0"
+```
+
+Implement comparison between an existing baseline and a later challenger run.
+
+Comparison should report:
+
+- Added, removed, renamed, and moved abilities/capabilities/features.
+- Hierarchy quality changes, especially misplaced features under the wrong
+  capability.
+- Native-utility precision: whether generated capabilities are repo-owned,
+  facade/adapter, dependency, tooling, fixture, schema-example, or mention-only.
+- Coverage against the repo-scoping golden profile.
+- Regression flags for known-bad patterns.
+- Source-ref quality: whether claims cite product intent, docs, source, tests,
+  fixtures, examples, or generated/derived scope.
+
+Acceptance criteria:
+- Comparison output is useful in both machine-readable JSON and human-readable
+  Markdown.
+- The report makes it easy to choose "old better", "new better", "tie", or
+  "needs review".
+- It does not require candidates to have stable database IDs across runs.
+- It can compare deterministic-only and agent-reviewed runs without losing
+  provenance.
+
+## T06: Add Side-By-Side Review UI
+
+```task
+id: RREG-WP-0013-T06
+status: todo
+priority: medium
+state_hub_task_id: "16a60b7c-7e2c-4bb0-b4ab-2381289dba0b"
+```
+
+Expose baseline/challenger comparison in the curator UI.
+
+Acceptance criteria:
+- Reviewers can select two assessment artifacts for repo-scoping.
+- The UI shows the two hierarchy trees side by side with moved/misplaced items
+  highlighted.
+- Reviewers can record preference, tie, rejection, and notes.
+- Review decisions are persisted as assessment outcomes, not as changes to the
+  underlying historical artifacts.
+
+## T07: Add Self-Scoping Regression Command
+
+```task
+id: RREG-WP-0013-T07
+status: todo
+priority: medium
+state_hub_task_id: "af1fcecd-686d-4592-b739-4698abc98c55"
+```
+
+Add a repeatable command for running repo-scoping against itself and comparing
+the result to the active baseline.
+
+Acceptance criteria:
+- The command captures engine identity before running analysis.
+- The command can run deterministic-only without LLM or agentic review.
+- The command can optionally invoke agentic review when configured.
+- The command emits a comparison report and exits non-zero only for explicit
+  CI-blocking regressions, not for ordinary "needs review" assessment outcomes.
+
+## T08: Document Assessment Workflow
+
+```task
+id: RREG-WP-0013-T08
+status: todo
+priority: medium
+state_hub_task_id: "30d71946-3598-4dc7-9970-c7c18126cad7"
+```
+
+Document how maintainers should use self-scoping assessment artifacts while
+evolving the engine.
+
+Acceptance criteria:
+- Documentation explains baseline, challenger, preferred, tied, rejected, and
+  superseded outcomes.
+- Documentation explains engine release binding and why unbound output is not
+  comparable.
+- Documentation gives examples for the known-bad LLM-provider regression and a
+  desired native repo-scoping profile.
+- Documentation describes when to update the golden profile versus when to fix
+  the engine.
+
+## Completion Criteria
+
+- repo-scoping has an immutable, release-bound self-scoping assessment format.
+- The current known-bad output is captured as a negative regression seed.
+- A curated desired repo-scoping profile exists.
+- Maintainers can rerun repo-scoping on itself, compare old/new results, and
+  record which output is better.
+- Comparison results are bound to the repo-scoping release that generated them.
--- a/workplans/RREG-WP-0014-agentic-characteristic-acceptance.md
+++ b/workplans/RREG-WP-0014-agentic-characteristic-acceptance.md
@@ -0,0 +1,225 @@
+---
+id: RREG-WP-0014
+type: workplan
+title: "Agentic Characteristic Acceptance"
+domain: capabilities
+repo: repo-scoping
+status: active
+owner: codex
+topic_slug: foerster-capabilities
+created: "2026-05-15"
+updated: "2026-05-15"
+state_hub_workstream_id: "7feaa5b5-32d8-4b8e-b377-cbb3ddacf64a"
+---
+
+# Agentic Characteristic Acceptance
+
+Deterministic rules should not automatically accept candidate
+characteristics. Determinism is strongest at fast, source-linked observation and
+at applying transparent rejection or downgrade criteria: facts, provenance,
+formal quality checks, schema validation, duplicate detection, and clear
+negative filters.
+
+Acceptance is a judgement step. When automation stands in for human judgement,
+it should be agentic: inspect the evidence, apply the visible quality criteria,
+explain the decision, and leave a reviewable trace. Deterministic rules may
+invalidate, downgrade, or require review, but they should not silently promote a
+candidate into approved registry truth.
+
+## T01: Define Acceptance Policy Boundary
+
+```task
+id: RREG-WP-0014-T01
+status: todo
+priority: high
+state_hub_task_id: "4bc2e749-ec9e-45d4-8095-63181efb752b"
+```
+
+Write the policy boundary between deterministic gates and acceptance
+judgement.
+
+Policy principles:
+
+- Deterministic scanners generate observed facts and source refs.
+- Deterministic quality gates can reject, downgrade, merge, flag, or require
+  review when criteria are formally expressible.
+- Deterministic quality gates cannot approve candidate characteristics.
+- Human reviewers can approve.
+- Trusted agentic reviewers can approve only after producing an evidence-based
+  rationale.
+- All automated review outcomes must be inspectable and reversible.
+
+Acceptance criteria:
+- Documentation states that deterministic auto-approval is prohibited.
+- Existing "trusted auto-approve" terminology is marked for replacement or
+  migration.
+- The allowed deterministic outcomes are explicitly listed.
+- The allowed agentic outcomes are explicitly listed.
+
+## T02: Create Transparent Quality Criteria Registry
+
+```task
+id: RREG-WP-0014-T02
+status: todo
+priority: high
+state_hub_task_id: "101998a4-8cf8-4df0-8d05-c4e2041c0cac"
+```
+
+Create a reviewable quality criteria registry for candidate characteristics.
+
+Initial criteria should cover:
+
+- Source-role quality: intent/docs/source/tests are stronger than fixtures,
+  schema examples, agent guidance, CI/tooling, dependency declarations, or
+  derived scope.
+- Native utility: owned/facade/adapter claims require explicit product evidence;
+  dependency, tooling, configuration, fixture, schema-example, and mention-only
+  claims are not native capabilities.
+- Hierarchy fit: features should support their parent capability; misplaced
+  API/CLI surfaces should be flagged.
+- Evidence sufficiency: candidate claims need source refs that support the
+  actual abstraction, not just matching vocabulary.
+- Circularity: generated `SCOPE.md` text cannot be primary proof for rebuilding
+  the same characteristic model.
+- Fixture contamination: tests and expectation files can prove scanner behavior
+  but should not become repo-native product capability claims.
+
+Acceptance criteria:
+- Criteria are stored in a versioned, human-readable format.
+- Each criterion has an identifier, description, severity, deterministic action
+  if applicable, and reviewer guidance.
+- Criteria can be listed through CLI and/or API.
+- Assessment and review records include the criteria version used.
+
+## T03: Implement Deterministic Quality Gate Outcomes
+
+```task
+id: RREG-WP-0014-T03
+status: todo
+priority: high
+state_hub_task_id: "d599c084-a207-4910-9d0b-578d0c50f282"
+```
+
+Apply quality criteria before any human or agentic acceptance step.
+
+Acceptance criteria:
+- Candidate abilities, capabilities, features, and evidence can carry gate
+  outcomes such as `pass`, `downgraded`, `rejected`, `requires_review`, and
+  `invalidated`.
+- Rejected or invalidated candidates remain auditable with reason codes.
+- Downgraded candidates remain visible but cannot be accepted without explicit
+  reviewer override.
+- Deterministic gates never mark a candidate as approved.
+- The known repo-scoping LLM-provider self-scan failure is flagged before
+  acceptance.
+
+## T04: Replace Trusted Auto-Approval With Agentic Review
+
+```task
+id: RREG-WP-0014-T04
+status: todo
+priority: high
+state_hub_task_id: "b0d29756-7460-4ffa-8d56-d94cfb34e94f"
+```
+
+Replace `trusted_auto_approve_candidate_graph` behavior with an agentic review
+workflow.
+
+Acceptance criteria:
+- Existing API/CLI/UI affordances no longer present deterministic
+  auto-approval as a safe path.
+- A configured agentic reviewer receives the candidate graph, source refs,
+  quality-gate outcomes, criteria version, and repository context.
+- The reviewer can approve, reject, downgrade, request human review, relink,
+  or propose edits.
+- Each agentic approval includes a rationale tied to evidence and criteria.
+- If no agentic reviewer is configured, candidates remain pending review.
+
+## T05: Add Review Decision Audit Trail
+
+```task
+id: RREG-WP-0014-T05
+status: todo
+priority: high
+state_hub_task_id: "0d12559a-831e-40ff-bf82-85f45b763f07"
+```
+
+Extend review decisions so acceptance history is useful for later audits and
+self-scoping assessments.
+
+Acceptance criteria:
+- Review decisions record reviewer type: human, agent, deterministic-gate, or
+  migration.
+- Agentic decisions record reviewer identity/configuration, criteria version,
+  prompt or policy version, evidence inspected, and rationale.
+- Deterministic gate decisions record rule IDs and outcomes, not approval.
+- Review records distinguish "candidate accepted as-is" from "accepted after
+  edits/relinks".
+- Existing decisions remain readable through a migration or compatibility view.
+
+## T06: Add Human Override And Criteria Refinement Flow
+
+```task
+id: RREG-WP-0014-T06
+status: todo
+priority: medium
+state_hub_task_id: "bcba3237-fb87-4a38-8e96-12b872d5e6a9"
+```
+
+Make quality criteria reviewable and refineable instead of hidden in code.
+
+Acceptance criteria:
+- Reviewers can inspect which criteria fired for a candidate.
+- Reviewers can override a gate with a reason.
+- Overrides are searchable so repeated overrides can drive criteria changes.
+- Criteria changes are versioned and linked to workplans or decisions.
+- The UI makes it clear when a candidate is blocked by formal criteria versus
+  merely awaiting judgement.
+
+## T07: Regression Coverage For Acceptance Boundary
+
+```task
+id: RREG-WP-0014-T07
+status: todo
+priority: high
+state_hub_task_id: "37a22c89-ded5-42dd-aaa9-ece79477fcff"
+```
+
+Add tests that lock in the new acceptance boundary.
+
+Acceptance criteria:
+- Deterministic analysis can generate facts and candidates but cannot approve
+  them.
+- Deterministic gates can reject/downgrade/require review with reason codes.
+- Agentic review can approve only with a rationale and criteria version.
+- The repo-scoping self-scan LLM-provider failure is not accepted by
+  deterministic rules.
+- Existing manual review and approval paths keep working.
+
+## T08: Migration And Compatibility Plan
+
+```task
+id: RREG-WP-0014-T08
+status: todo
+priority: medium
+state_hub_task_id: "3d5475f6-71a7-4ca7-aa69-573e91d1fe1e"
+```
+
+Plan the migration away from trusted deterministic auto-approval.
+
+Acceptance criteria:
+- Existing approved maps created by trusted auto-approval can be identified.
+- Users can rebuild or re-review those maps without losing audit history.
+- API and CLI changes are documented with compatibility notes.
+- The old behavior is either removed or guarded behind an explicit deprecated
+  migration mode that cannot run by default.
+
+## Completion Criteria
+
+- Deterministic rules no longer approve candidate characteristics.
+- Transparent, versioned quality criteria can reject, downgrade, invalidate, or
+  require review.
+- Agentic review is the only automated path that can stand in for human
+  acceptance.
+- Acceptance decisions are auditable, evidence-bound, and useful as training
+  signal for future self-scoping assessment.