Files
repo-scoping/docs/abstraction-strategy.md

2.9 KiB

Abstraction Strategy

The registry has three layers with different trust levels:

  1. Observed facts are deterministic scanner output: files, manifests, framework hints, tests, docs, routes, commands, and source locations.
  2. Candidate claims are abstractions proposed from those facts. They are useful review seeds, not registry truth.
  3. Approved entries are curated truth after human review or an explicit trusted automation mode.

Granularity

Features should describe a user-visible or operational behavior surface, not mirror individual scanner facts. A one-to-one pattern such as one route fact becoming one feature is a smell unless the repository truly exposes only one behavior.

Current deterministic grouping:

  • Multiple HTTP route facts become one HTTP API surface feature with several source references.
  • Multiple CLI command facts become one CLI command surface feature with several source references.
  • Facts remain available as drilldown evidence through source_refs.

This gives reviewers orientation at the behavior level while keeping provenance.

What Deterministic Logic Can Do

Deterministic scanners can reliably identify:

  • repository structure and languages
  • package manifests and framework hints
  • API/CLI entry-point surfaces
  • docs, examples, and tests as corroborating evidence
  • stable source references for review and approval

Deterministic candidate generation can group these into conservative capabilities such as interface exposure and repository structure. It should avoid pretending it understands domain intent when the evidence is thin.

Where LLM Assistance Helps

LLMs are most useful for naming and explaining intent:

  • turning HTTP API surface into a domain capability such as repository ingestion, review workflow, or search
  • separating administrative, operational, and product-facing capabilities
  • summarizing README and code context into clearer ability descriptions
  • suggesting merges or relinks when deterministic names are too generic

LLM output remains candidate material. It should cite source paths and be reviewed or explicitly auto-approved by a trusted mode before becoming approved registry truth.

Trial Repo Observations

repo-scoping demonstrates the current boundary well: deterministic scanning sees FastAPI routes, tests, docs, and Python structure, but the meaningful abstractions are repository ingestion, deterministic analysis, candidate review, discovery, and State Hub coordination. Those names likely require either review edits or LLM assistance.

The other trial repos reinforce the same point: fact lists are useful audit trails, but the primary UI should lead with candidate or approved ability maps and expose facts as drilldown evidence.

Regression Guard

tests/test_candidate_graph.py includes a guard that multiple interface facts are grouped into behavioral surface features with multiple source refs. This protects against falling back to one feature per observed fact.