generated from coulomb/repo-seed
277 lines
9.9 KiB
Markdown
277 lines
9.9 KiB
Markdown
# Repo Reality Scanner
|
|
|
|
The repo reality scanner discovers Fabric entities from repository evidence and
|
|
turns them into candidate graph facts. It is a discovery layer, not a new
|
|
authoring surface. Repo-owned declarations remain the highest-trust source for
|
|
accepted Fabric graph data.
|
|
|
|
## Contract
|
|
|
|
A scanner run emits a `FabricDiscoverySnapshot`. The snapshot is scoped to one
|
|
repository, one commit, and one scan profile. It contains:
|
|
|
|
- replacement scopes, which define the evidence sets that may be replaced on a
|
|
rescan
|
|
- candidate nodes, edges, and attributes
|
|
- source anchors for every candidate
|
|
- extractor provenance for every candidate
|
|
- tombstones for candidates that vanished inside a replacement scope
|
|
- reconciliation policy metadata
|
|
|
|
The JSON schema lives at `schemas/discovery-snapshot.schema.yaml`.
|
|
|
|
## Deterministic Scanner CLI
|
|
|
|
The first implementation slice adds an offline deterministic scan command:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--commit "$(git rev-parse HEAD)" \
|
|
--dry-run \
|
|
--output discovery-snapshot.json
|
|
```
|
|
|
|
Use `--json` to print the full `FabricDiscoverySnapshot` to stdout. Without
|
|
`--json`, the command prints a concise summary of node, edge, attribute, and
|
|
replacement-scope counts. The scanner does not call registries, catalogs, or
|
|
LLMs in this mode; `--output` is the only write side effect.
|
|
|
|
The deterministic extractor framework currently covers:
|
|
|
|
- repository metadata from local git/path evidence
|
|
- README, INTENT, and SCOPE document presence and headings
|
|
- repo-owned Fabric declarations under `fabric/`
|
|
- Python `pyproject.toml` package metadata and dependencies
|
|
- Node `package.json` package metadata and dependencies
|
|
- common lockfiles such as `package-lock.json`, `poetry.lock`, and `uv.lock`
|
|
- Dockerfiles and Docker Compose services
|
|
- OpenAPI and AsyncAPI contract files
|
|
- Score workload files
|
|
- Kubernetes-style deployment manifests
|
|
- common service config files such as `application.yaml` and
|
|
`appsettings.json`
|
|
|
|
Each extractor emits candidates through the same accumulator so stable-key
|
|
duplicates merge inside a scan before the snapshot is returned.
|
|
|
|
## LLM-Assisted Extraction
|
|
|
|
LLM extraction is optional and explicit:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--llm \
|
|
--llm-provider openai \
|
|
--llm-model gpt-4.1-mini \
|
|
--dry-run \
|
|
--output discovery-with-llm.json
|
|
```
|
|
|
|
The implementation integrates through `llm-connect` with `create_adapter` and
|
|
`RunConfig`. Tests use a `MockLLMAdapter`-compatible boundary so CI stays
|
|
offline. If `llm-connect` is unavailable, the provider call fails, or the model
|
|
returns malformed JSON, the scanner records a `review_artifacts` entry and keeps
|
|
the discovery snapshot schema-valid.
|
|
|
|
The LLM never receives the whole repository. The scanner first builds a compact
|
|
evidence bundle from deterministic candidates, prioritizing repo-owned Fabric
|
|
declarations, services, capabilities, interfaces, libraries, deployments, and
|
|
small README/INTENT/SCOPE signals. The prompt asks for strict JSON:
|
|
|
|
```json
|
|
{"nodes": [], "edges": [], "attributes": []}
|
|
```
|
|
|
|
Projected LLM candidates are always `origin: llm` and
|
|
`review_state: needs_review`. Candidates below the configured confidence
|
|
threshold become `llm_low_confidence` review artifacts instead of graph
|
|
candidates. Unresolved edge endpoints or attribute targets also become review
|
|
artifacts. Accepted graph data still requires deterministic evidence,
|
|
repo-owned declarations, or a later human review/acceptance path.
|
|
|
|
## Reconciliation And Dry-Run Diffs
|
|
|
|
Scans can be reconciled against a previous discovery snapshot:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--previous-snapshot previous-discovery.json \
|
|
--dry-run \
|
|
--output current-discovery.json
|
|
```
|
|
|
|
The reconciler writes `reconciliation.diff` with explicit stable-key sets:
|
|
|
|
- `added`
|
|
- `changed`
|
|
- `retired`
|
|
- `conflicted`
|
|
|
|
It deduplicates candidates by stable key, merges source anchors and provenance,
|
|
and applies source-aware precedence when duplicate candidates disagree. The
|
|
current precedence is:
|
|
|
|
1. `repo_declaration`
|
|
2. `deterministic`
|
|
3. `catalog`
|
|
4. `registry`
|
|
5. `llm`
|
|
6. `manual`
|
|
|
|
Possible duplicates found through matching aliases, normalized labels,
|
|
relationship endpoints, or attribute targets are not silently merged. They are
|
|
marked `status: conflicted`, moved to `review_state: needs_review`, and listed
|
|
under `reconciliation.conflicts`.
|
|
|
|
Missing previous candidates become tombstones only when their replacement scope
|
|
is present in the current scan and has `mode: replacement`. Missing candidates
|
|
from additive scopes, such as broad LLM evidence bundles, are left alone.
|
|
Existing tombstones are preserved so repeated scans can explain graph drift.
|
|
|
|
## Identity
|
|
|
|
Identity is the main safety boundary. The scanner must not append guesses on
|
|
every run. It needs to produce stable keys that are repeatable for the same
|
|
observed entity.
|
|
|
|
Candidate node keys use this shape:
|
|
|
|
```text
|
|
discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
|
|
```
|
|
|
|
Use the optional source fingerprint when a name is too generic or when multiple
|
|
entities of the same kind can share a display name. Examples include HTTP
|
|
routes, generated clients, deployment manifests, and catalog records.
|
|
|
|
Candidate edge keys use a relationship fingerprint over:
|
|
|
|
- source stable key
|
|
- edge type
|
|
- target stable key
|
|
- optional evidence scope
|
|
|
|
Candidate attribute keys use the entity stable key plus the normalized
|
|
attribute name and, where needed, a source fingerprint.
|
|
|
|
Stable-key parts are lowercased and normalized to ASCII-like identity segments.
|
|
The helper functions in `railiance_fabric.discovery` define the initial rules.
|
|
|
|
## Source Anchors
|
|
|
|
Every candidate must carry one or more source anchors. A source anchor identifies
|
|
why the scanner believes the fact exists. Anchors can point to files, package
|
|
manifests, lockfiles, API contracts, deployment manifests, service catalogs,
|
|
registries, LLM evidence bundles, or manual review notes.
|
|
|
|
Source anchors include a fingerprint. The fingerprint should cover stable
|
|
location fields such as path, URL, ref, line range, or JSON pointer. Snippets are
|
|
useful for review but should not be the only identity anchor because formatting
|
|
noise can churn snippets.
|
|
|
|
## Replacement Scopes
|
|
|
|
A replacement scope says which extractor owns which set of candidates. Rescans
|
|
may retire missing candidates only inside the same scope.
|
|
|
|
Examples:
|
|
|
|
- `scope:repo-scoping:python-package:package_manifest:<hash>`
|
|
- `scope:state-hub:fabric-declarations:declaration`
|
|
- `scope:llm-connect:readme-summary:file:<hash>`
|
|
- `scope:railiance-fabric:local-registry:fabric_registry`
|
|
|
|
Scopes have a mode:
|
|
|
|
- `replacement`: candidates missing from the next run in the same scope become
|
|
tombstones.
|
|
- `additive`: candidates are added or updated, but absence does not retire old
|
|
candidates.
|
|
|
|
LLM extractors should usually use replacement mode only for tightly bounded
|
|
evidence bundles. Broad repo summaries are safer as additive or review-only
|
|
until the extraction prompts are proven stable.
|
|
|
|
## Merge Precedence
|
|
|
|
When multiple sources describe the same entity, reconciliation uses this
|
|
precedence:
|
|
|
|
1. `repo_declaration`
|
|
2. `deterministic`
|
|
3. `catalog`
|
|
4. `registry`
|
|
5. `llm`
|
|
6. `manual`
|
|
|
|
Manual review can override local candidate state, but it should not silently
|
|
rewrite repo-owned declarations. If accepted discoveries should become
|
|
authoritative, the safer next step is to generate a repo-owned declaration patch
|
|
for human review.
|
|
|
|
## Duplicate Handling
|
|
|
|
The reconciler should merge candidates with the same stable key automatically.
|
|
It should also look for possible duplicates using:
|
|
|
|
- alias overlap
|
|
- identical source anchors
|
|
- identical evidence fingerprints
|
|
- normalized label similarity within the same entity kind
|
|
- relationship fingerprints with the same endpoints and edge type
|
|
- declaration ids that match discovery aliases
|
|
|
|
Exact stable-key matches can be merged automatically. Alias-only or
|
|
similarity-only matches should become `needs_review` conflicts unless an
|
|
extractor has a source-specific rule that makes the match deterministic.
|
|
|
|
## Rescan And Tombstones
|
|
|
|
On a rescan, the scanner compares the previous accepted discovery snapshot with
|
|
the newly produced snapshot for the same repo/profile.
|
|
|
|
- Same stable key: update in place.
|
|
- Same source anchor but changed attributes: update with changed evidence.
|
|
- Missing from same replacement scope: create a tombstone.
|
|
- Missing from a different scope: leave untouched.
|
|
- Reappears after tombstone: reactivate if the stable key and scope match.
|
|
- Reappears with a new key but same alias/source anchor: flag as possible
|
|
duplicate resurrection.
|
|
|
|
Tombstones explain graph drift and prevent immediate re-creation loops. They
|
|
should be retained long enough to compare several scan cycles and can later be
|
|
compacted by repo, extractor, or entity kind.
|
|
|
|
## Mapping To Fabric Graphs
|
|
|
|
Discovery candidates can project into the existing graph model when accepted:
|
|
|
|
- candidate service nodes map to `ServiceDeclaration`-like graph nodes
|
|
- candidate capabilities and interfaces map to provider surface nodes
|
|
- candidate dependencies map to dependency nodes and `consumes` edges
|
|
- candidate deployment/runtime entities map to graph explorer infrastructure
|
|
nodes until declarations gain first-class runtime support
|
|
- candidate libraries map to library inventory records and graph explorer nodes
|
|
|
|
If a repo-owned declaration already exists for the same entity, discovery output
|
|
should attach as supporting evidence instead of creating another node.
|
|
|
|
## LLM Boundary
|
|
|
|
LLM extraction through `llm-connect` is optional and schema-gated. The scanner
|
|
should use deterministic preselection to build small evidence bundles, ask for
|
|
structured JSON, validate the JSON against the discovery schema, and record:
|
|
|
|
- extractor id and version
|
|
- prompt version
|
|
- provider and model
|
|
- usage metadata
|
|
- confidence and uncertainty
|
|
- rationale
|
|
|
|
Malformed, low-confidence, or conflicting LLM output becomes review material,
|
|
not accepted graph data.
|