Add discovery snapshot contract

This commit is contained in:
2026-05-19 03:37:05 +02:00
parent e150270511
commit 1c0995004e
5 changed files with 1037 additions and 2 deletions

View File

@@ -0,0 +1,165 @@
# Repo Reality Scanner
The repo reality scanner discovers Fabric entities from repository evidence and
turns them into candidate graph facts. It is a discovery layer, not a new
authoring surface. Repo-owned declarations remain the highest-trust source for
accepted Fabric graph data.
## Contract
A scanner run emits a `FabricDiscoverySnapshot`. The snapshot is scoped to one
repository, one commit, and one scan profile. It contains:
- replacement scopes, which define the evidence sets that may be replaced on a
rescan
- candidate nodes, edges, and attributes
- source anchors for every candidate
- extractor provenance for every candidate
- tombstones for candidates that vanished inside a replacement scope
- reconciliation policy metadata
The JSON schema lives at `schemas/discovery-snapshot.schema.yaml`.
## Identity
Identity is the main safety boundary. The scanner must not append guesses on
every run. It needs to produce stable keys that are repeatable for the same
observed entity.
Candidate node keys use this shape:
```text
discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
```
Use the optional source fingerprint when a name is too generic or when multiple
entities of the same kind can share a display name. Examples include HTTP
routes, generated clients, deployment manifests, and catalog records.
Candidate edge keys use a relationship fingerprint over:
- source stable key
- edge type
- target stable key
- optional evidence scope
Candidate attribute keys use the entity stable key plus the normalized
attribute name and, where needed, a source fingerprint.
Stable-key parts are lowercased and normalized to ASCII-like identity segments.
The helper functions in `railiance_fabric.discovery` define the initial rules.
## Source Anchors
Every candidate must carry one or more source anchors. A source anchor identifies
why the scanner believes the fact exists. Anchors can point to files, package
manifests, lockfiles, API contracts, deployment manifests, service catalogs,
registries, LLM evidence bundles, or manual review notes.
Source anchors include a fingerprint. The fingerprint should cover stable
location fields such as path, URL, ref, line range, or JSON pointer. Snippets are
useful for review but should not be the only identity anchor because formatting
noise can churn snippets.
## Replacement Scopes
A replacement scope says which extractor owns which set of candidates. Rescans
may retire missing candidates only inside the same scope.
Examples:
- `scope:repo-scoping:python-package:package_manifest:<hash>`
- `scope:state-hub:fabric-declarations:declaration`
- `scope:llm-connect:readme-summary:file:<hash>`
- `scope:railiance-fabric:local-registry:fabric_registry`
Scopes have a mode:
- `replacement`: candidates missing from the next run in the same scope become
tombstones.
- `additive`: candidates are added or updated, but absence does not retire old
candidates.
LLM extractors should usually use replacement mode only for tightly bounded
evidence bundles. Broad repo summaries are safer as additive or review-only
until the extraction prompts are proven stable.
## Merge Precedence
When multiple sources describe the same entity, reconciliation uses this
precedence:
1. `repo_declaration`
2. `deterministic`
3. `catalog`
4. `registry`
5. `llm`
6. `manual`
Manual review can override local candidate state, but it should not silently
rewrite repo-owned declarations. If accepted discoveries should become
authoritative, the safer next step is to generate a repo-owned declaration patch
for human review.
## Duplicate Handling
The reconciler should merge candidates with the same stable key automatically.
It should also look for possible duplicates using:
- alias overlap
- identical source anchors
- identical evidence fingerprints
- normalized label similarity within the same entity kind
- relationship fingerprints with the same endpoints and edge type
- declaration ids that match discovery aliases
Exact stable-key matches can be merged automatically. Alias-only or
similarity-only matches should become `needs_review` conflicts unless an
extractor has a source-specific rule that makes the match deterministic.
## Rescan And Tombstones
On a rescan, the scanner compares the previous accepted discovery snapshot with
the newly produced snapshot for the same repo/profile.
- Same stable key: update in place.
- Same source anchor but changed attributes: update with changed evidence.
- Missing from same replacement scope: create a tombstone.
- Missing from a different scope: leave untouched.
- Reappears after tombstone: reactivate if the stable key and scope match.
- Reappears with a new key but same alias/source anchor: flag as possible
duplicate resurrection.
Tombstones explain graph drift and prevent immediate re-creation loops. They
should be retained long enough to compare several scan cycles and can later be
compacted by repo, extractor, or entity kind.
## Mapping To Fabric Graphs
Discovery candidates can project into the existing graph model when accepted:
- candidate service nodes map to `ServiceDeclaration`-like graph nodes
- candidate capabilities and interfaces map to provider surface nodes
- candidate dependencies map to dependency nodes and `consumes` edges
- candidate deployment/runtime entities map to graph explorer infrastructure
nodes until declarations gain first-class runtime support
- candidate libraries map to library inventory records and graph explorer nodes
If a repo-owned declaration already exists for the same entity, discovery output
should attach as supporting evidence instead of creating another node.
## LLM Boundary
LLM extraction through `llm-connect` is optional and schema-gated. The scanner
should use deterministic preselection to build small evidence bundles, ask for
structured JSON, validate the JSON against the discovery schema, and record:
- extractor id and version
- prompt version
- provider and model
- usage metadata
- confidence and uncertainty
- rationale
Malformed, low-confidence, or conflicting LLM output becomes review material,
not accepted graph data.