railiance-fabric/docs/repo-reality-scanner.md

# Repo Reality Scanner

The repo reality scanner discovers Fabric entities from repository evidence and
turns them into candidate graph facts. It is a discovery layer, not a new
authoring surface. Repo-owned declarations remain high-trust self-description
evidence, but financial Fabric ownership, tenant boundaries, and cross-boundary
utility relations must be resolved from accountability roots or review
decisions before they become accepted graph data.

## Contract

A scanner run emits a `FabricDiscoverySnapshot`. The snapshot is scoped to one
repository, one commit, and one scan profile. It contains:

- replacement scopes, which define the evidence sets that may be replaced on a
  rescan
- candidate nodes, edges, and attributes
- source anchors for every candidate
- extractor provenance for every candidate
- tombstones for candidates that vanished inside a replacement scope
- reconciliation policy metadata

The JSON schema lives at `schemas/discovery-snapshot.schema.yaml`.

## Deterministic Scanner CLI

The first implementation slice adds an offline deterministic scan command:

```bash
railiance-fabric scan . \
  --repo-slug railiance-fabric \
  --commit "$(git rev-parse HEAD)" \
  --dry-run \
  --output discovery-snapshot.json
```

Use `--json` to print the full `FabricDiscoverySnapshot` to stdout. Without
`--json`, the command prints a concise summary of node, edge, attribute, and
replacement-scope counts. The scanner does not call registries, catalogs, or
LLMs in this mode; `--output` is the only write side effect.

The deterministic extractor framework currently covers:

- repository metadata from local git/path evidence
- README, INTENT, and SCOPE document presence and headings
- repo-owned Fabric declarations under `fabric/`
- Python `pyproject.toml` package metadata and dependencies
- Node `package.json` package metadata and dependencies
- common lockfiles such as `package-lock.json`, `poetry.lock`, and `uv.lock`
- Dockerfiles and Docker Compose services
- OpenAPI and AsyncAPI contract files
- Score workload files
- Kubernetes-style deployment manifests
- common service config files such as `application.yaml` and
  `appsettings.json`

Each extractor emits candidates through the same accumulator so stable-key
duplicates merge inside a scan before the snapshot is returned.

## LLM-Assisted Extraction

LLM extraction is optional and explicit:

```bash
railiance-fabric scan . \
  --repo-slug railiance-fabric \
  --llm \
  --llm-provider openai \
  --llm-model gpt-4.1-mini \
  --dry-run \
  --output discovery-with-llm.json
```

The implementation integrates through `llm-connect` with `create_adapter` and
`RunConfig`. Tests use a `MockLLMAdapter`-compatible boundary so CI stays
offline. If `llm-connect` is unavailable, the provider call fails, or the model
returns malformed JSON, the scanner records a `review_artifacts` entry and keeps
the discovery snapshot schema-valid.

The LLM never receives the whole repository. The scanner first builds a compact
evidence bundle from deterministic candidates, prioritizing durable local
evidence such as Fabric declarations, services, capabilities, interfaces,
libraries, deployments, and small README/INTENT/SCOPE signals. The prompt asks
for strict JSON:

```json
{"nodes": [], "edges": [], "attributes": []}
```

Projected LLM candidates are always `origin: llm` and
`review_state: needs_review`. Candidates below the configured confidence
threshold become `llm_low_confidence` review artifacts instead of graph
candidates. Unresolved edge endpoints or attribute targets also become review
artifacts. Accepted graph data still requires deterministic evidence,
repo-owned declarations, or a later human review/acceptance path.

## Reconciliation And Dry-Run Diffs

Scans can be reconciled against a previous discovery snapshot:

```bash
railiance-fabric scan . \
  --repo-slug railiance-fabric \
  --previous-snapshot previous-discovery.json \
  --dry-run \
  --output current-discovery.json
```

The reconciler writes `reconciliation.diff` with explicit stable-key sets:

- `added`
- `changed`
- `retired`
- `conflicted`

It deduplicates candidates by stable key, merges source anchors and provenance,
and applies source-aware precedence when duplicate candidates disagree. The
current precedence is:

1. `repo_declaration`
2. `deterministic`
3. `catalog`
4. `registry`
5. `llm`
6. `manual`

Possible duplicates found through matching aliases, normalized labels,
relationship endpoints, or attribute targets are not silently merged. They are
marked `status: conflicted`, moved to `review_state: needs_review`, and listed
under `reconciliation.conflicts`.

Missing previous candidates become tombstones only when their replacement scope
is present in the current scan and has `mode: replacement`. Missing candidates
from additive scopes, such as broad LLM evidence bundles, are left alone.
Existing tombstones are preserved so repeated scans can explain graph drift.

## Registry Review And Acceptance

Discovery snapshots can be stored in the Fabric registry for review:

```bash
railiance-fabric scan . \
  --repo-slug railiance-fabric \
  --previous-snapshot previous-discovery.json \
  --output discovery.json

railiance-fabric registry ingest-discovery discovery.json \
  --repo-slug railiance-fabric
```

The registry keeps discovery snapshots separately from accepted graph snapshots
by repo, commit, and scan profile. It exposes latest/list/diff API routes so a
dry run can be reviewed without changing the accepted graph.

Accepted discovery can be projected into a normal graph snapshot:

```bash
railiance-fabric registry accept-discovery railiance-fabric 12 \
  --accepted-key discovery:railiance-fabric:service-declaration:example
```

By default, the accept path only projects candidates already marked
`review_state: accepted`. Passing `--accepted-key` explicitly includes selected
candidate stable keys. Existing accepted graph nodes win over discovery nodes
with the same graph id, so repo-owned declarations are preserved. Projected
nodes carry discovery stable key, origin, review state, confidence, provenance,
and source anchors in graph attributes; the graph explorer payload exposes
those fields for review. For financial Fabric fields, accepted discovery still
needs an accountability-root, baseline inheritance rule, or explicit review
decision.

## Connector Follow-Up

Connector follow-up is explicit and separated from repo-local extraction:

```bash
railiance-fabric scan . \
  --repo-slug railiance-fabric \
  --connector local-fabric-registry \
  --connector-manifest registry/local-repos.yaml \
  --dry-run
```

The connector interface has slots for:

- package registries
- container registries
- API catalogs
- service catalogs
- deployment inventories
- existing Fabric registry data

The first implementation is `local-fabric-registry`, an offline-safe connector
that reads a local onboarding manifest such as `registry/local-repos.yaml`. It
adds a `FabricRegistryEntry` candidate, a `cataloged_as` edge from the
repository node, and registry-sourced attributes such as domain, remote URL,
default branch, State Hub repo id, and declaration paths.

Connector evidence uses its own replacement scope with source kind
`fabric_registry`, so rescans can replace catalog facts without retiring
repo-local evidence. Connector run metadata is recorded under `connector_runs`
with status, source, message, and candidate counts.

Connector-derived facts should be treated this way:

- accepted: only when the connector reads explicit repo-owned evidence,
  accountability-root evidence, or a catalog already governed as authoritative
  for that field
- candidate: stable local registry facts such as onboarding manifest entries,
  declared remote URLs, State Hub ids, and declaration paths
- review-only: missing catalogs, rate limits, connector failures, ambiguous
  matches, or facts from catalogs with unclear ownership

Failures do not corrupt the scan. Missing catalogs become
`connector_unavailable` review artifacts, malformed catalogs become
`connector_failed` artifacts, and future remote connectors should use
`connector_rate_limited` when backoff is required.

## Multi-Repo Orchestration

Known local repos can be scanned from the same onboarding manifest used by
`registry sync-manifest`:

```bash
railiance-fabric registry scan-manifest registry/local-repos.yaml \
  --dry-run \
  --output-dir .fabric-discovery
```

The command isolates each repo. A missing path, invalid previous snapshot, or
registry write failure is reported for that repo without aborting the rest of
the run. The summary includes repo counts for scanned, changed, retired,
conflicted, LLM skipped, LLM failed, ingested, accepted, and errors so it can be
copied into State Hub progress notes or future automation output.

Useful controls:

- `--repo-slug <slug>` can be repeated to scan an allowlist.
- `--profile <name>` tags the scan profile and output filename.
- `--previous-dir <dir>` reconciles each repo against
  `<slug>-<profile>.discovery.json` from an earlier run.
- `--llm` enables LLM-assisted extraction; `--deterministic-only` forces the
  offline rule path.
- `--llm-max-runs <n>` caps how many repos may attempt LLM extraction in one
  orchestration run, while `--llm-max-tokens` remains the per-repo request cap.
- `--connector local-fabric-registry` attaches manifest-derived registry facts
  to every repo scan.
- `--ingest` stores discovery snapshots in the registry; `--accept` then
  projects accepted candidates into graph snapshots. `--dry-run` suppresses
  registry writes even when those flags are present.

Example review cycle:

```bash
railiance-fabric registry scan-manifest registry/local-repos.yaml \
  --repo-slug railiance-fabric \
  --previous-dir .fabric-discovery \
  --output-dir .fabric-discovery \
  --connector local-fabric-registry \
  --dry-run
```

After review, rerun with `--ingest` to store the snapshots. Add `--accept` only
when candidates marked `review_state: accepted` should be projected into the
registry graph.

For repeated operational loops, including default cache paths, registry-backed
previous snapshots, run reports, exit codes, and rescan health views, see
`docs/operational-rescan-loops.md`.

## Scan Profiles And Review Workflow

The initial profile is `deterministic`, which means repo-local extraction plus
any explicitly enabled offline connectors. Additional profiles should be named
for the evidence policy they represent, for example `deterministic-llm-draft`
or `catalog-followup`. Keep profile names stable because per-repo previous
snapshots use `<slug>-<profile>.discovery.json`.

Recommended workflow:

1. Run `scan` or `registry scan-manifest` with `--dry-run`.
2. Reconcile with `--previous-snapshot` or `--previous-dir` when a prior
   snapshot exists.
3. Review candidates with `review_state: needs_review`, `status: conflicted`,
   tombstones, and review artifacts before accepting anything.
4. Store reviewed output with `registry ingest-discovery`.
5. Use `registry accept-discovery` or `registry scan-manifest --ingest --accept`
   only for candidates whose review state is acceptable for projection.

## Failure Modes

Failures are captured close to the evidence source:

- Missing repo paths, invalid manifest entries, unreadable previous snapshots,
  and registry request failures mark that repo as `status: error` in
  `scan-manifest` without stopping other repos.
- Connector failures become review artifacts such as `connector_unavailable` or
  `connector_failed`.
- LLM provider failures and malformed model output become `llm_execution_error`
  or `llm_output_invalid` review artifacts.
- Low-confidence LLM candidates become `llm_low_confidence` artifacts instead
  of graph candidates.
- Possible duplicates are marked as conflicts and left for review instead of
  being silently merged.

## Rollout Dry Run

The first small local rollout ran on 2026-05-19:

```bash
railiance-fabric registry scan-manifest registry/local-repos.yaml \
  --repo-slug repo-scoping \
  --repo-slug llm-connect \
  --repo-slug railiance-fabric \
  --dry-run \
  --connector local-fabric-registry
```

Result:

- `repo-scoping`: 18 nodes, 17 edges, 13 attributes
- `llm-connect`: 5 nodes, 4 edges, 13 attributes
- `railiance-fabric`: 55 nodes, 63 edges, 13 attributes
- summary: 3 scanned, 0 changed, 0 retired, 0 conflicted, 3 LLM skipped,
  0 LLM failed, 0 accepted, 0 errors

Follow-up backlog from this first pass:

- Add a standard discovery snapshot directory, likely `.fabric-discovery/`, so
  repeated dry-runs can reconcile by default.
- Add a previous-from-registry option so manifest scans can diff against the
  latest stored discovery snapshot without exporting JSON first.
- Expand runtime/deployment extraction beyond local manifests to cover live
  server and deployment inventory connectors.
- Add review UI affordances for conflicts, tombstones, and bulk acceptance once
  enough repos have baseline snapshots.
- Define privacy and budget defaults before enabling non-mock LLM providers in
  multi-repo scans.

## Identity

Identity is the main safety boundary. The scanner must not append guesses on
every run. It needs to produce stable keys that are repeatable for the same
observed entity.

Candidate node keys use this shape:

```text
discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
```

Use the optional source fingerprint when a name is too generic or when multiple
entities of the same kind can share a display name. Examples include HTTP
routes, generated clients, deployment manifests, and catalog records.

Candidate edge keys use a relationship fingerprint over:

- source stable key
- edge type
- target stable key
- optional evidence scope

Candidate attribute keys use the entity stable key plus the normalized
attribute name and, where needed, a source fingerprint.

Stable-key parts are lowercased and normalized to ASCII-like identity segments.
The helper functions in `railiance_fabric.discovery` define the initial rules.

## Source Anchors

Every candidate must carry one or more source anchors. A source anchor identifies
why the scanner believes the fact exists. Anchors can point to files, package
manifests, lockfiles, API contracts, deployment manifests, service catalogs,
registries, LLM evidence bundles, or manual review notes.

Source anchors include a fingerprint. The fingerprint should cover stable
location fields such as path, URL, ref, line range, or JSON pointer. Snippets are
useful for review but should not be the only identity anchor because formatting
noise can churn snippets.

## Replacement Scopes

A replacement scope says which extractor owns which set of candidates. Rescans
may retire missing candidates only inside the same scope.

Examples:

- `scope:repo-scoping:python-package:package_manifest:<hash>`
- `scope:state-hub:fabric-declarations:declaration`
- `scope:llm-connect:readme-summary:file:<hash>`
- `scope:railiance-fabric:local-registry:fabric_registry`

Scopes have a mode:

- `replacement`: candidates missing from the next run in the same scope become
  tombstones.
- `additive`: candidates are added or updated, but absence does not retire old
  candidates.

LLM extractors should usually use replacement mode only for tightly bounded
evidence bundles. Broad repo summaries are safer as additive or review-only
until the extraction prompts are proven stable.

## Merge Precedence

When multiple sources describe the same entity, reconciliation uses this
precedence:

1. `repo_declaration`
2. `deterministic`
3. `catalog`
4. `registry`
5. `llm`
6. `manual`

Manual review can override local candidate state, but it should not silently
rewrite repo-owned declarations. If accepted discoveries should become durable
repo-local evidence, generate a repo-owned declaration patch for human review.
If they affect financial ownership, fabric containment, tenancy, or utility
value boundaries, generate a baseline or accountability-root review item
instead.

## Duplicate Handling

The reconciler should merge candidates with the same stable key automatically.
It should also look for possible duplicates using:

- alias overlap
- identical source anchors
- identical evidence fingerprints
- normalized label similarity within the same entity kind
- relationship fingerprints with the same endpoints and edge type
- declaration ids that match discovery aliases

Exact stable-key matches can be merged automatically. Alias-only or
similarity-only matches should become `needs_review` conflicts unless an
extractor has a source-specific rule that makes the match deterministic.

## Rescan And Tombstones

On a rescan, the scanner compares the previous accepted discovery snapshot with
the newly produced snapshot for the same repo/profile.

- Same stable key: update in place.
- Same source anchor but changed attributes: update with changed evidence.
- Missing from same replacement scope: create a tombstone.
- Missing from a different scope: leave untouched.
- Reappears after tombstone: reactivate if the stable key and scope match.
- Reappears with a new key but same alias/source anchor: flag as possible
  duplicate resurrection.

Tombstones explain graph drift and prevent immediate re-creation loops. They
should be retained long enough to compare several scan cycles and can later be
compacted by repo, extractor, or entity kind.

## Mapping To Fabric Graphs

Discovery candidates can project into the existing graph model when accepted:

- candidate service nodes map to `ServiceDeclaration`-like graph nodes
- candidate capabilities and interfaces map to provider surface nodes
- candidate dependencies map to dependency nodes and `consumes` edges
- candidate deployment/runtime entities map to graph explorer infrastructure
  nodes until declarations gain first-class runtime support
- candidate libraries map to library inventory records and graph explorer nodes

If a repo-owned declaration already exists for the same entity, discovery output
should attach as supporting evidence instead of creating another node.

## LLM Boundary

LLM extraction through `llm-connect` is optional and schema-gated. The scanner
should use deterministic preselection to build small evidence bundles, ask for
structured JSON, validate the JSON against the discovery schema, and record:

- extractor id and version
- prompt version
- provider and model
- usage metadata
- confidence and uncertainty
- rationale

Malformed, low-confidence, or conflicting LLM output becomes review material,
not accepted graph data.