Add discovery snapshot contract

2026-05-19 03:37:05 +02:00
parent e150270511
commit 1c0995004e
5 changed files with 1037 additions and 2 deletions
--- a/docs/repo-reality-scanner.md
+++ b/docs/repo-reality-scanner.md
@@ -0,0 +1,165 @@
+# Repo Reality Scanner
+
+The repo reality scanner discovers Fabric entities from repository evidence and
+turns them into candidate graph facts. It is a discovery layer, not a new
+authoring surface. Repo-owned declarations remain the highest-trust source for
+accepted Fabric graph data.
+
+## Contract
+
+A scanner run emits a `FabricDiscoverySnapshot`. The snapshot is scoped to one
+repository, one commit, and one scan profile. It contains:
+
+- replacement scopes, which define the evidence sets that may be replaced on a
+  rescan
+- candidate nodes, edges, and attributes
+- source anchors for every candidate
+- extractor provenance for every candidate
+- tombstones for candidates that vanished inside a replacement scope
+- reconciliation policy metadata
+
+The JSON schema lives at `schemas/discovery-snapshot.schema.yaml`.
+
+## Identity
+
+Identity is the main safety boundary. The scanner must not append guesses on
+every run. It needs to produce stable keys that are repeatable for the same
+observed entity.
+
+Candidate node keys use this shape:
+
+```text
+discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
+```
+
+Use the optional source fingerprint when a name is too generic or when multiple
+entities of the same kind can share a display name. Examples include HTTP
+routes, generated clients, deployment manifests, and catalog records.
+
+Candidate edge keys use a relationship fingerprint over:
+
+- source stable key
+- edge type
+- target stable key
+- optional evidence scope
+
+Candidate attribute keys use the entity stable key plus the normalized
+attribute name and, where needed, a source fingerprint.
+
+Stable-key parts are lowercased and normalized to ASCII-like identity segments.
+The helper functions in `railiance_fabric.discovery` define the initial rules.
+
+## Source Anchors
+
+Every candidate must carry one or more source anchors. A source anchor identifies
+why the scanner believes the fact exists. Anchors can point to files, package
+manifests, lockfiles, API contracts, deployment manifests, service catalogs,
+registries, LLM evidence bundles, or manual review notes.
+
+Source anchors include a fingerprint. The fingerprint should cover stable
+location fields such as path, URL, ref, line range, or JSON pointer. Snippets are
+useful for review but should not be the only identity anchor because formatting
+noise can churn snippets.
+
+## Replacement Scopes
+
+A replacement scope says which extractor owns which set of candidates. Rescans
+may retire missing candidates only inside the same scope.
+
+Examples:
+
+- `scope:repo-scoping:python-package:package_manifest:<hash>`
+- `scope:state-hub:fabric-declarations:declaration`
+- `scope:llm-connect:readme-summary:file:<hash>`
+- `scope:railiance-fabric:local-registry:fabric_registry`
+
+Scopes have a mode:
+
+- `replacement`: candidates missing from the next run in the same scope become
+  tombstones.
+- `additive`: candidates are added or updated, but absence does not retire old
+  candidates.
+
+LLM extractors should usually use replacement mode only for tightly bounded
+evidence bundles. Broad repo summaries are safer as additive or review-only
+until the extraction prompts are proven stable.
+
+## Merge Precedence
+
+When multiple sources describe the same entity, reconciliation uses this
+precedence:
+
+1. `repo_declaration`
+2. `deterministic`
+3. `catalog`
+4. `registry`
+5. `llm`
+6. `manual`
+
+Manual review can override local candidate state, but it should not silently
+rewrite repo-owned declarations. If accepted discoveries should become
+authoritative, the safer next step is to generate a repo-owned declaration patch
+for human review.
+
+## Duplicate Handling
+
+The reconciler should merge candidates with the same stable key automatically.
+It should also look for possible duplicates using:
+
+- alias overlap
+- identical source anchors
+- identical evidence fingerprints
+- normalized label similarity within the same entity kind
+- relationship fingerprints with the same endpoints and edge type
+- declaration ids that match discovery aliases
+
+Exact stable-key matches can be merged automatically. Alias-only or
+similarity-only matches should become `needs_review` conflicts unless an
+extractor has a source-specific rule that makes the match deterministic.
+
+## Rescan And Tombstones
+
+On a rescan, the scanner compares the previous accepted discovery snapshot with
+the newly produced snapshot for the same repo/profile.
+
+- Same stable key: update in place.
+- Same source anchor but changed attributes: update with changed evidence.
+- Missing from same replacement scope: create a tombstone.
+- Missing from a different scope: leave untouched.
+- Reappears after tombstone: reactivate if the stable key and scope match.
+- Reappears with a new key but same alias/source anchor: flag as possible
+  duplicate resurrection.
+
+Tombstones explain graph drift and prevent immediate re-creation loops. They
+should be retained long enough to compare several scan cycles and can later be
+compacted by repo, extractor, or entity kind.
+
+## Mapping To Fabric Graphs
+
+Discovery candidates can project into the existing graph model when accepted:
+
+- candidate service nodes map to `ServiceDeclaration`-like graph nodes
+- candidate capabilities and interfaces map to provider surface nodes
+- candidate dependencies map to dependency nodes and `consumes` edges
+- candidate deployment/runtime entities map to graph explorer infrastructure
+  nodes until declarations gain first-class runtime support
+- candidate libraries map to library inventory records and graph explorer nodes
+
+If a repo-owned declaration already exists for the same entity, discovery output
+should attach as supporting evidence instead of creating another node.
+
+## LLM Boundary
+
+LLM extraction through `llm-connect` is optional and schema-gated. The scanner
+should use deterministic preselection to build small evidence bundles, ask for
+structured JSON, validate the JSON against the discovery schema, and record:
+
+- extractor id and version
+- prompt version
+- provider and model
+- usage metadata
+- confidence and uncertainty
+- rationale
+
+Malformed, low-confidence, or conflicting LLM output becomes review material,
+not accepted graph data.