generated from coulomb/repo-seed
Add discovery snapshot contract
This commit is contained in:
165
docs/repo-reality-scanner.md
Normal file
165
docs/repo-reality-scanner.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Repo Reality Scanner
|
||||
|
||||
The repo reality scanner discovers Fabric entities from repository evidence and
|
||||
turns them into candidate graph facts. It is a discovery layer, not a new
|
||||
authoring surface. Repo-owned declarations remain the highest-trust source for
|
||||
accepted Fabric graph data.
|
||||
|
||||
## Contract
|
||||
|
||||
A scanner run emits a `FabricDiscoverySnapshot`. The snapshot is scoped to one
|
||||
repository, one commit, and one scan profile. It contains:
|
||||
|
||||
- replacement scopes, which define the evidence sets that may be replaced on a
|
||||
rescan
|
||||
- candidate nodes, edges, and attributes
|
||||
- source anchors for every candidate
|
||||
- extractor provenance for every candidate
|
||||
- tombstones for candidates that vanished inside a replacement scope
|
||||
- reconciliation policy metadata
|
||||
|
||||
The JSON schema lives at `schemas/discovery-snapshot.schema.yaml`.
|
||||
|
||||
## Identity
|
||||
|
||||
Identity is the main safety boundary. The scanner must not append guesses on
|
||||
every run. It needs to produce stable keys that are repeatable for the same
|
||||
observed entity.
|
||||
|
||||
Candidate node keys use this shape:
|
||||
|
||||
```text
|
||||
discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
|
||||
```
|
||||
|
||||
Use the optional source fingerprint when a name is too generic or when multiple
|
||||
entities of the same kind can share a display name. Examples include HTTP
|
||||
routes, generated clients, deployment manifests, and catalog records.
|
||||
|
||||
Candidate edge keys use a relationship fingerprint over:
|
||||
|
||||
- source stable key
|
||||
- edge type
|
||||
- target stable key
|
||||
- optional evidence scope
|
||||
|
||||
Candidate attribute keys use the entity stable key plus the normalized
|
||||
attribute name and, where needed, a source fingerprint.
|
||||
|
||||
Stable-key parts are lowercased and normalized to ASCII-like identity segments.
|
||||
The helper functions in `railiance_fabric.discovery` define the initial rules.
|
||||
|
||||
## Source Anchors
|
||||
|
||||
Every candidate must carry one or more source anchors. A source anchor identifies
|
||||
why the scanner believes the fact exists. Anchors can point to files, package
|
||||
manifests, lockfiles, API contracts, deployment manifests, service catalogs,
|
||||
registries, LLM evidence bundles, or manual review notes.
|
||||
|
||||
Source anchors include a fingerprint. The fingerprint should cover stable
|
||||
location fields such as path, URL, ref, line range, or JSON pointer. Snippets are
|
||||
useful for review but should not be the only identity anchor because formatting
|
||||
noise can churn snippets.
|
||||
|
||||
## Replacement Scopes
|
||||
|
||||
A replacement scope says which extractor owns which set of candidates. Rescans
|
||||
may retire missing candidates only inside the same scope.
|
||||
|
||||
Examples:
|
||||
|
||||
- `scope:repo-scoping:python-package:package_manifest:<hash>`
|
||||
- `scope:state-hub:fabric-declarations:declaration`
|
||||
- `scope:llm-connect:readme-summary:file:<hash>`
|
||||
- `scope:railiance-fabric:local-registry:fabric_registry`
|
||||
|
||||
Scopes have a mode:
|
||||
|
||||
- `replacement`: candidates missing from the next run in the same scope become
|
||||
tombstones.
|
||||
- `additive`: candidates are added or updated, but absence does not retire old
|
||||
candidates.
|
||||
|
||||
LLM extractors should usually use replacement mode only for tightly bounded
|
||||
evidence bundles. Broad repo summaries are safer as additive or review-only
|
||||
until the extraction prompts are proven stable.
|
||||
|
||||
## Merge Precedence
|
||||
|
||||
When multiple sources describe the same entity, reconciliation uses this
|
||||
precedence:
|
||||
|
||||
1. `repo_declaration`
|
||||
2. `deterministic`
|
||||
3. `catalog`
|
||||
4. `registry`
|
||||
5. `llm`
|
||||
6. `manual`
|
||||
|
||||
Manual review can override local candidate state, but it should not silently
|
||||
rewrite repo-owned declarations. If accepted discoveries should become
|
||||
authoritative, the safer next step is to generate a repo-owned declaration patch
|
||||
for human review.
|
||||
|
||||
## Duplicate Handling
|
||||
|
||||
The reconciler should merge candidates with the same stable key automatically.
|
||||
It should also look for possible duplicates using:
|
||||
|
||||
- alias overlap
|
||||
- identical source anchors
|
||||
- identical evidence fingerprints
|
||||
- normalized label similarity within the same entity kind
|
||||
- relationship fingerprints with the same endpoints and edge type
|
||||
- declaration ids that match discovery aliases
|
||||
|
||||
Exact stable-key matches can be merged automatically. Alias-only or
|
||||
similarity-only matches should become `needs_review` conflicts unless an
|
||||
extractor has a source-specific rule that makes the match deterministic.
|
||||
|
||||
## Rescan And Tombstones
|
||||
|
||||
On a rescan, the scanner compares the previous accepted discovery snapshot with
|
||||
the newly produced snapshot for the same repo/profile.
|
||||
|
||||
- Same stable key: update in place.
|
||||
- Same source anchor but changed attributes: update with changed evidence.
|
||||
- Missing from same replacement scope: create a tombstone.
|
||||
- Missing from a different scope: leave untouched.
|
||||
- Reappears after tombstone: reactivate if the stable key and scope match.
|
||||
- Reappears with a new key but same alias/source anchor: flag as possible
|
||||
duplicate resurrection.
|
||||
|
||||
Tombstones explain graph drift and prevent immediate re-creation loops. They
|
||||
should be retained long enough to compare several scan cycles and can later be
|
||||
compacted by repo, extractor, or entity kind.
|
||||
|
||||
## Mapping To Fabric Graphs
|
||||
|
||||
Discovery candidates can project into the existing graph model when accepted:
|
||||
|
||||
- candidate service nodes map to `ServiceDeclaration`-like graph nodes
|
||||
- candidate capabilities and interfaces map to provider surface nodes
|
||||
- candidate dependencies map to dependency nodes and `consumes` edges
|
||||
- candidate deployment/runtime entities map to graph explorer infrastructure
|
||||
nodes until declarations gain first-class runtime support
|
||||
- candidate libraries map to library inventory records and graph explorer nodes
|
||||
|
||||
If a repo-owned declaration already exists for the same entity, discovery output
|
||||
should attach as supporting evidence instead of creating another node.
|
||||
|
||||
## LLM Boundary
|
||||
|
||||
LLM extraction through `llm-connect` is optional and schema-gated. The scanner
|
||||
should use deterministic preselection to build small evidence bundles, ask for
|
||||
structured JSON, validate the JSON against the discovery schema, and record:
|
||||
|
||||
- extractor id and version
|
||||
- prompt version
|
||||
- provider and model
|
||||
- usage metadata
|
||||
- confidence and uncertainty
|
||||
- rationale
|
||||
|
||||
Malformed, low-confidence, or conflicting LLM output becomes review material,
|
||||
not accepted graph data.
|
||||
Reference in New Issue
Block a user