12 KiB
Repo Reality Scanner
The repo reality scanner discovers Fabric entities from repository evidence and turns them into candidate graph facts. It is a discovery layer, not a new authoring surface. Repo-owned declarations remain the highest-trust source for accepted Fabric graph data.
Contract
A scanner run emits a FabricDiscoverySnapshot. The snapshot is scoped to one
repository, one commit, and one scan profile. It contains:
- replacement scopes, which define the evidence sets that may be replaced on a rescan
- candidate nodes, edges, and attributes
- source anchors for every candidate
- extractor provenance for every candidate
- tombstones for candidates that vanished inside a replacement scope
- reconciliation policy metadata
The JSON schema lives at schemas/discovery-snapshot.schema.yaml.
Deterministic Scanner CLI
The first implementation slice adds an offline deterministic scan command:
railiance-fabric scan . \
--repo-slug railiance-fabric \
--commit "$(git rev-parse HEAD)" \
--dry-run \
--output discovery-snapshot.json
Use --json to print the full FabricDiscoverySnapshot to stdout. Without
--json, the command prints a concise summary of node, edge, attribute, and
replacement-scope counts. The scanner does not call registries, catalogs, or
LLMs in this mode; --output is the only write side effect.
The deterministic extractor framework currently covers:
- repository metadata from local git/path evidence
- README, INTENT, and SCOPE document presence and headings
- repo-owned Fabric declarations under
fabric/ - Python
pyproject.tomlpackage metadata and dependencies - Node
package.jsonpackage metadata and dependencies - common lockfiles such as
package-lock.json,poetry.lock, anduv.lock - Dockerfiles and Docker Compose services
- OpenAPI and AsyncAPI contract files
- Score workload files
- Kubernetes-style deployment manifests
- common service config files such as
application.yamlandappsettings.json
Each extractor emits candidates through the same accumulator so stable-key duplicates merge inside a scan before the snapshot is returned.
LLM-Assisted Extraction
LLM extraction is optional and explicit:
railiance-fabric scan . \
--repo-slug railiance-fabric \
--llm \
--llm-provider openai \
--llm-model gpt-4.1-mini \
--dry-run \
--output discovery-with-llm.json
The implementation integrates through llm-connect with create_adapter and
RunConfig. Tests use a MockLLMAdapter-compatible boundary so CI stays
offline. If llm-connect is unavailable, the provider call fails, or the model
returns malformed JSON, the scanner records a review_artifacts entry and keeps
the discovery snapshot schema-valid.
The LLM never receives the whole repository. The scanner first builds a compact evidence bundle from deterministic candidates, prioritizing repo-owned Fabric declarations, services, capabilities, interfaces, libraries, deployments, and small README/INTENT/SCOPE signals. The prompt asks for strict JSON:
{"nodes": [], "edges": [], "attributes": []}
Projected LLM candidates are always origin: llm and
review_state: needs_review. Candidates below the configured confidence
threshold become llm_low_confidence review artifacts instead of graph
candidates. Unresolved edge endpoints or attribute targets also become review
artifacts. Accepted graph data still requires deterministic evidence,
repo-owned declarations, or a later human review/acceptance path.
Reconciliation And Dry-Run Diffs
Scans can be reconciled against a previous discovery snapshot:
railiance-fabric scan . \
--repo-slug railiance-fabric \
--previous-snapshot previous-discovery.json \
--dry-run \
--output current-discovery.json
The reconciler writes reconciliation.diff with explicit stable-key sets:
addedchangedretiredconflicted
It deduplicates candidates by stable key, merges source anchors and provenance, and applies source-aware precedence when duplicate candidates disagree. The current precedence is:
repo_declarationdeterministiccatalogregistryllmmanual
Possible duplicates found through matching aliases, normalized labels,
relationship endpoints, or attribute targets are not silently merged. They are
marked status: conflicted, moved to review_state: needs_review, and listed
under reconciliation.conflicts.
Missing previous candidates become tombstones only when their replacement scope
is present in the current scan and has mode: replacement. Missing candidates
from additive scopes, such as broad LLM evidence bundles, are left alone.
Existing tombstones are preserved so repeated scans can explain graph drift.
Connector Follow-Up
Connector follow-up is explicit and separated from repo-local extraction:
railiance-fabric scan . \
--repo-slug railiance-fabric \
--connector local-fabric-registry \
--connector-manifest registry/local-repos.yaml \
--dry-run
The connector interface has slots for:
- package registries
- container registries
- API catalogs
- service catalogs
- deployment inventories
- existing Fabric registry data
The first implementation is local-fabric-registry, an offline-safe connector
that reads a local onboarding manifest such as registry/local-repos.yaml. It
adds a FabricRegistryEntry candidate, a cataloged_as edge from the
repository node, and registry-sourced attributes such as domain, remote URL,
default branch, State Hub repo id, and declaration paths.
Connector evidence uses its own replacement scope with source kind
fabric_registry, so rescans can replace catalog facts without retiring
repo-local evidence. Connector run metadata is recorded under connector_runs
with status, source, message, and candidate counts.
Connector-derived facts should be treated this way:
- accepted: only when the connector reads explicit repo-owned declarations or a catalog already governed as authoritative for that field
- candidate: stable local registry facts such as onboarding manifest entries, declared remote URLs, State Hub ids, and declaration paths
- review-only: missing catalogs, rate limits, connector failures, ambiguous matches, or facts from catalogs with unclear ownership
Failures do not corrupt the scan. Missing catalogs become
connector_unavailable review artifacts, malformed catalogs become
connector_failed artifacts, and future remote connectors should use
connector_rate_limited when backoff is required.
Identity
Identity is the main safety boundary. The scanner must not append guesses on every run. It needs to produce stable keys that are repeatable for the same observed entity.
Candidate node keys use this shape:
discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
Use the optional source fingerprint when a name is too generic or when multiple entities of the same kind can share a display name. Examples include HTTP routes, generated clients, deployment manifests, and catalog records.
Candidate edge keys use a relationship fingerprint over:
- source stable key
- edge type
- target stable key
- optional evidence scope
Candidate attribute keys use the entity stable key plus the normalized attribute name and, where needed, a source fingerprint.
Stable-key parts are lowercased and normalized to ASCII-like identity segments.
The helper functions in railiance_fabric.discovery define the initial rules.
Source Anchors
Every candidate must carry one or more source anchors. A source anchor identifies why the scanner believes the fact exists. Anchors can point to files, package manifests, lockfiles, API contracts, deployment manifests, service catalogs, registries, LLM evidence bundles, or manual review notes.
Source anchors include a fingerprint. The fingerprint should cover stable location fields such as path, URL, ref, line range, or JSON pointer. Snippets are useful for review but should not be the only identity anchor because formatting noise can churn snippets.
Replacement Scopes
A replacement scope says which extractor owns which set of candidates. Rescans may retire missing candidates only inside the same scope.
Examples:
scope:repo-scoping:python-package:package_manifest:<hash>scope:state-hub:fabric-declarations:declarationscope:llm-connect:readme-summary:file:<hash>scope:railiance-fabric:local-registry:fabric_registry
Scopes have a mode:
replacement: candidates missing from the next run in the same scope become tombstones.additive: candidates are added or updated, but absence does not retire old candidates.
LLM extractors should usually use replacement mode only for tightly bounded evidence bundles. Broad repo summaries are safer as additive or review-only until the extraction prompts are proven stable.
Merge Precedence
When multiple sources describe the same entity, reconciliation uses this precedence:
repo_declarationdeterministiccatalogregistryllmmanual
Manual review can override local candidate state, but it should not silently rewrite repo-owned declarations. If accepted discoveries should become authoritative, the safer next step is to generate a repo-owned declaration patch for human review.
Duplicate Handling
The reconciler should merge candidates with the same stable key automatically. It should also look for possible duplicates using:
- alias overlap
- identical source anchors
- identical evidence fingerprints
- normalized label similarity within the same entity kind
- relationship fingerprints with the same endpoints and edge type
- declaration ids that match discovery aliases
Exact stable-key matches can be merged automatically. Alias-only or
similarity-only matches should become needs_review conflicts unless an
extractor has a source-specific rule that makes the match deterministic.
Rescan And Tombstones
On a rescan, the scanner compares the previous accepted discovery snapshot with the newly produced snapshot for the same repo/profile.
- Same stable key: update in place.
- Same source anchor but changed attributes: update with changed evidence.
- Missing from same replacement scope: create a tombstone.
- Missing from a different scope: leave untouched.
- Reappears after tombstone: reactivate if the stable key and scope match.
- Reappears with a new key but same alias/source anchor: flag as possible duplicate resurrection.
Tombstones explain graph drift and prevent immediate re-creation loops. They should be retained long enough to compare several scan cycles and can later be compacted by repo, extractor, or entity kind.
Mapping To Fabric Graphs
Discovery candidates can project into the existing graph model when accepted:
- candidate service nodes map to
ServiceDeclaration-like graph nodes - candidate capabilities and interfaces map to provider surface nodes
- candidate dependencies map to dependency nodes and
consumesedges - candidate deployment/runtime entities map to graph explorer infrastructure nodes until declarations gain first-class runtime support
- candidate libraries map to library inventory records and graph explorer nodes
If a repo-owned declaration already exists for the same entity, discovery output should attach as supporting evidence instead of creating another node.
LLM Boundary
LLM extraction through llm-connect is optional and schema-gated. The scanner
should use deterministic preselection to build small evidence bundles, ask for
structured JSON, validate the JSON against the discovery schema, and record:
- extractor id and version
- prompt version
- provider and model
- usage metadata
- confidence and uncertainty
- rationale
Malformed, low-confidence, or conflicting LLM output becomes review material, not accepted graph data.