repo-scoping/docs/abstraction-strategy.md

# Abstraction Strategy

The registry has three layers with different trust levels:

1. Observed facts are deterministic scanner output: files, manifests, framework
   hints, tests, docs, routes, commands, and source locations.
2. Candidate claims are abstractions proposed from those facts. They are useful
   review seeds, not registry truth.
3. Approved entries are curated truth after human review or an explicit trusted
   automation mode.

## Granularity

Features should describe a user-visible or operational behavior surface, not mirror
individual scanner facts. A one-to-one pattern such as one route fact becoming one
feature is a smell unless the repository truly exposes only one behavior.

Current deterministic grouping:

- Multiple HTTP route facts become one `HTTP API surface` feature with several
  source references.
- Multiple CLI command facts become one `CLI command surface` feature with several
  source references.
- Facts remain available as drilldown evidence through `source_refs`.

This gives reviewers orientation at the behavior level while keeping provenance.

## What Deterministic Logic Can Do

Deterministic scanners can reliably identify:

- repository structure and languages
- package manifests and framework hints
- API/CLI entry-point surfaces
- docs, examples, and tests as corroborating evidence
- stable source references for review and approval

Deterministic candidate generation can group these into conservative capabilities
such as interface exposure and repository structure. It should avoid pretending it
understands domain intent when the evidence is thin.

## Where LLM Assistance Helps

LLMs are most useful for naming and explaining intent:

- turning `HTTP API surface` into a domain capability such as repository ingestion,
  review workflow, or search
- separating administrative, operational, and product-facing capabilities
- summarizing README and code context into clearer ability descriptions
- suggesting merges or relinks when deterministic names are too generic

LLM output remains candidate material. It should cite source paths and be reviewed
or explicitly auto-approved by a trusted mode before becoming approved registry
truth.

## Trial Repo Observations

`repo-registry` demonstrates the current boundary well: deterministic scanning sees
FastAPI routes, tests, docs, and Python structure, but the meaningful abstractions
are repository ingestion, deterministic analysis, candidate review, discovery, and
State Hub coordination. Those names likely require either review edits or LLM
assistance.

The other trial repos reinforce the same point: fact lists are useful audit trails,
but the primary UI should lead with candidate or approved ability maps and expose
facts as drilldown evidence.

## Regression Guard

`tests/test_candidate_graph.py` includes a guard that multiple interface facts are
grouped into behavioral surface features with multiple source refs. This protects
against falling back to one feature per observed fact.