generated from coulomb/repo-seed
485 lines
18 KiB
Markdown
485 lines
18 KiB
Markdown
# Repo Reality Scanner
|
|
|
|
The repo reality scanner discovers Fabric entities from repository evidence and
|
|
turns them into candidate graph facts. It is a discovery layer, not a new
|
|
authoring surface. Repo-owned declarations remain high-trust self-description
|
|
evidence, but financial Fabric ownership, tenant boundaries, and cross-boundary
|
|
utility relations must be resolved from accountability roots or review
|
|
decisions before they become accepted graph data.
|
|
|
|
## Contract
|
|
|
|
A scanner run emits a `FabricDiscoverySnapshot`. The snapshot is scoped to one
|
|
repository, one commit, and one scan profile. It contains:
|
|
|
|
- replacement scopes, which define the evidence sets that may be replaced on a
|
|
rescan
|
|
- candidate nodes, edges, and attributes
|
|
- source anchors for every candidate
|
|
- extractor provenance for every candidate
|
|
- tombstones for candidates that vanished inside a replacement scope
|
|
- reconciliation policy metadata
|
|
|
|
The JSON schema lives at `schemas/discovery-snapshot.schema.yaml`.
|
|
|
|
## Deterministic Scanner CLI
|
|
|
|
The first implementation slice adds an offline deterministic scan command:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--commit "$(git rev-parse HEAD)" \
|
|
--dry-run \
|
|
--output discovery-snapshot.json
|
|
```
|
|
|
|
Use `--json` to print the full `FabricDiscoverySnapshot` to stdout. Without
|
|
`--json`, the command prints a concise summary of node, edge, attribute, and
|
|
replacement-scope counts. The scanner does not call registries, catalogs, or
|
|
LLMs in this mode; `--output` is the only write side effect.
|
|
|
|
The deterministic extractor framework currently covers:
|
|
|
|
- repository metadata from local git/path evidence
|
|
- README, INTENT, and SCOPE document presence and headings
|
|
- repo-owned Fabric declarations under `fabric/`
|
|
- Python `pyproject.toml` package metadata and dependencies
|
|
- Node `package.json` package metadata and dependencies
|
|
- common lockfiles such as `package-lock.json`, `poetry.lock`, and `uv.lock`
|
|
- Dockerfiles and Docker Compose services
|
|
- OpenAPI and AsyncAPI contract files
|
|
- Score workload files
|
|
- Kubernetes-style deployment manifests
|
|
- common service config files such as `application.yaml` and
|
|
`appsettings.json`
|
|
|
|
Each extractor emits candidates through the same accumulator so stable-key
|
|
duplicates merge inside a scan before the snapshot is returned.
|
|
|
|
## LLM-Assisted Extraction
|
|
|
|
LLM extraction is optional and explicit:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--llm \
|
|
--llm-provider openai \
|
|
--llm-model gpt-4.1-mini \
|
|
--dry-run \
|
|
--output discovery-with-llm.json
|
|
```
|
|
|
|
The implementation integrates through `llm-connect` with `create_adapter` and
|
|
`RunConfig`. Tests use a `MockLLMAdapter`-compatible boundary so CI stays
|
|
offline. If `llm-connect` is unavailable, the provider call fails, or the model
|
|
returns malformed JSON, the scanner records a `review_artifacts` entry and keeps
|
|
the discovery snapshot schema-valid.
|
|
|
|
The LLM never receives the whole repository. The scanner first builds a compact
|
|
evidence bundle from deterministic candidates, prioritizing durable local
|
|
evidence such as Fabric declarations, services, capabilities, interfaces,
|
|
libraries, deployments, and small README/INTENT/SCOPE signals. The prompt asks
|
|
for strict JSON:
|
|
|
|
```json
|
|
{"nodes": [], "edges": [], "attributes": []}
|
|
```
|
|
|
|
Projected LLM candidates are always `origin: llm` and
|
|
`review_state: needs_review`. Candidates below the configured confidence
|
|
threshold become `llm_low_confidence` review artifacts instead of graph
|
|
candidates. Unresolved edge endpoints or attribute targets also become review
|
|
artifacts. Accepted graph data still requires deterministic evidence,
|
|
repo-owned declarations, or a later human review/acceptance path.
|
|
|
|
## Reconciliation And Dry-Run Diffs
|
|
|
|
Scans can be reconciled against a previous discovery snapshot:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--previous-snapshot previous-discovery.json \
|
|
--dry-run \
|
|
--output current-discovery.json
|
|
```
|
|
|
|
The reconciler writes `reconciliation.diff` with explicit stable-key sets:
|
|
|
|
- `added`
|
|
- `changed`
|
|
- `retired`
|
|
- `conflicted`
|
|
|
|
It deduplicates candidates by stable key, merges source anchors and provenance,
|
|
and applies source-aware precedence when duplicate candidates disagree. The
|
|
current precedence is:
|
|
|
|
1. `repo_declaration`
|
|
2. `deterministic`
|
|
3. `catalog`
|
|
4. `registry`
|
|
5. `llm`
|
|
6. `manual`
|
|
|
|
Possible duplicates found through matching aliases, normalized labels,
|
|
relationship endpoints, or attribute targets are not silently merged. They are
|
|
marked `status: conflicted`, moved to `review_state: needs_review`, and listed
|
|
under `reconciliation.conflicts`.
|
|
|
|
Missing previous candidates become tombstones only when their replacement scope
|
|
is present in the current scan and has `mode: replacement`. Missing candidates
|
|
from additive scopes, such as broad LLM evidence bundles, are left alone.
|
|
Existing tombstones are preserved so repeated scans can explain graph drift.
|
|
|
|
## Registry Review And Acceptance
|
|
|
|
Discovery snapshots can be stored in the Fabric registry for review:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--previous-snapshot previous-discovery.json \
|
|
--output discovery.json
|
|
|
|
railiance-fabric registry ingest-discovery discovery.json \
|
|
--repo-slug railiance-fabric
|
|
```
|
|
|
|
The registry keeps discovery snapshots separately from accepted graph snapshots
|
|
by repo, commit, and scan profile. It exposes latest/list/diff API routes so a
|
|
dry run can be reviewed without changing the accepted graph.
|
|
|
|
Accepted discovery can be projected into a normal graph snapshot:
|
|
|
|
```bash
|
|
railiance-fabric registry accept-discovery railiance-fabric 12 \
|
|
--accepted-key discovery:railiance-fabric:service-declaration:example
|
|
```
|
|
|
|
By default, the accept path only projects candidates already marked
|
|
`review_state: accepted`. Passing `--accepted-key` explicitly includes selected
|
|
candidate stable keys. Existing accepted graph nodes win over discovery nodes
|
|
with the same graph id, so repo-owned declarations are preserved. Projected
|
|
nodes carry discovery stable key, origin, review state, confidence, provenance,
|
|
and source anchors in graph attributes; the graph explorer payload exposes
|
|
those fields for review. For financial Fabric fields, accepted discovery still
|
|
needs an accountability-root, baseline inheritance rule, or explicit review
|
|
decision.
|
|
|
|
## Connector Follow-Up
|
|
|
|
Connector follow-up is explicit and separated from repo-local extraction:
|
|
|
|
```bash
|
|
railiance-fabric scan . \
|
|
--repo-slug railiance-fabric \
|
|
--connector local-fabric-registry \
|
|
--connector-manifest registry/local-repos.yaml \
|
|
--dry-run
|
|
```
|
|
|
|
The connector interface has slots for:
|
|
|
|
- package registries
|
|
- container registries
|
|
- API catalogs
|
|
- service catalogs
|
|
- deployment inventories
|
|
- existing Fabric registry data
|
|
|
|
The first implementation is `local-fabric-registry`, an offline-safe connector
|
|
that reads a local onboarding manifest such as `registry/local-repos.yaml`. It
|
|
adds a `FabricRegistryEntry` candidate, a `cataloged_as` edge from the
|
|
repository node, and registry-sourced attributes such as domain, remote URL,
|
|
default branch, State Hub repo id, and declaration paths.
|
|
|
|
Connector evidence uses its own replacement scope with source kind
|
|
`fabric_registry`, so rescans can replace catalog facts without retiring
|
|
repo-local evidence. Connector run metadata is recorded under `connector_runs`
|
|
with status, source, message, and candidate counts.
|
|
|
|
Connector-derived facts should be treated this way:
|
|
|
|
- accepted: only when the connector reads explicit repo-owned evidence,
|
|
accountability-root evidence, or a catalog already governed as authoritative
|
|
for that field
|
|
- candidate: stable local registry facts such as onboarding manifest entries,
|
|
declared remote URLs, State Hub ids, and declaration paths
|
|
- review-only: missing catalogs, rate limits, connector failures, ambiguous
|
|
matches, or facts from catalogs with unclear ownership
|
|
|
|
Failures do not corrupt the scan. Missing catalogs become
|
|
`connector_unavailable` review artifacts, malformed catalogs become
|
|
`connector_failed` artifacts, and future remote connectors should use
|
|
`connector_rate_limited` when backoff is required.
|
|
|
|
## Multi-Repo Orchestration
|
|
|
|
Known local repos can be scanned from the same onboarding manifest used by
|
|
`registry sync-manifest`:
|
|
|
|
```bash
|
|
railiance-fabric registry scan-manifest registry/local-repos.yaml \
|
|
--dry-run \
|
|
--output-dir .fabric-discovery
|
|
```
|
|
|
|
The command isolates each repo. A missing path, invalid previous snapshot, or
|
|
registry write failure is reported for that repo without aborting the rest of
|
|
the run. The summary includes repo counts for scanned, changed, retired,
|
|
conflicted, LLM skipped, LLM failed, ingested, accepted, and errors so it can be
|
|
copied into State Hub progress notes or future automation output.
|
|
|
|
Useful controls:
|
|
|
|
- `--repo-slug <slug>` can be repeated to scan an allowlist.
|
|
- `--profile <name>` tags the scan profile and output filename.
|
|
- `--previous-dir <dir>` reconciles each repo against
|
|
`<slug>-<profile>.discovery.json` from an earlier run.
|
|
- `--llm` enables LLM-assisted extraction; `--deterministic-only` forces the
|
|
offline rule path.
|
|
- `--llm-max-runs <n>` caps how many repos may attempt LLM extraction in one
|
|
orchestration run, while `--llm-max-tokens` remains the per-repo request cap.
|
|
- `--connector local-fabric-registry` attaches manifest-derived registry facts
|
|
to every repo scan.
|
|
- `--ingest` stores discovery snapshots in the registry; `--accept` then
|
|
projects accepted candidates into graph snapshots. `--dry-run` suppresses
|
|
registry writes even when those flags are present.
|
|
|
|
Example review cycle:
|
|
|
|
```bash
|
|
railiance-fabric registry scan-manifest registry/local-repos.yaml \
|
|
--repo-slug railiance-fabric \
|
|
--previous-dir .fabric-discovery \
|
|
--output-dir .fabric-discovery \
|
|
--connector local-fabric-registry \
|
|
--dry-run
|
|
```
|
|
|
|
After review, rerun with `--ingest` to store the snapshots. Add `--accept` only
|
|
when candidates marked `review_state: accepted` should be projected into the
|
|
registry graph.
|
|
|
|
For repeated operational loops, including default cache paths, registry-backed
|
|
previous snapshots, run reports, exit codes, and rescan health views, see
|
|
`docs/operational-rescan-loops.md`.
|
|
|
|
## Scan Profiles And Review Workflow
|
|
|
|
The initial profile is `deterministic`, which means repo-local extraction plus
|
|
any explicitly enabled offline connectors. Additional profiles should be named
|
|
for the evidence policy they represent, for example `deterministic-llm-draft`
|
|
or `catalog-followup`. Keep profile names stable because per-repo previous
|
|
snapshots use `<slug>-<profile>.discovery.json`.
|
|
|
|
Recommended workflow:
|
|
|
|
1. Run `scan` or `registry scan-manifest` with `--dry-run`.
|
|
2. Reconcile with `--previous-snapshot` or `--previous-dir` when a prior
|
|
snapshot exists.
|
|
3. Review candidates with `review_state: needs_review`, `status: conflicted`,
|
|
tombstones, and review artifacts before accepting anything.
|
|
4. Store reviewed output with `registry ingest-discovery`.
|
|
5. Use `registry accept-discovery` or `registry scan-manifest --ingest --accept`
|
|
only for candidates whose review state is acceptable for projection.
|
|
|
|
## Failure Modes
|
|
|
|
Failures are captured close to the evidence source:
|
|
|
|
- Missing repo paths, invalid manifest entries, unreadable previous snapshots,
|
|
and registry request failures mark that repo as `status: error` in
|
|
`scan-manifest` without stopping other repos.
|
|
- Connector failures become review artifacts such as `connector_unavailable` or
|
|
`connector_failed`.
|
|
- LLM provider failures and malformed model output become `llm_execution_error`
|
|
or `llm_output_invalid` review artifacts.
|
|
- Low-confidence LLM candidates become `llm_low_confidence` artifacts instead
|
|
of graph candidates.
|
|
- Possible duplicates are marked as conflicts and left for review instead of
|
|
being silently merged.
|
|
|
|
## Rollout Dry Run
|
|
|
|
The first small local rollout ran on 2026-05-19:
|
|
|
|
```bash
|
|
railiance-fabric registry scan-manifest registry/local-repos.yaml \
|
|
--repo-slug repo-scoping \
|
|
--repo-slug llm-connect \
|
|
--repo-slug railiance-fabric \
|
|
--dry-run \
|
|
--connector local-fabric-registry
|
|
```
|
|
|
|
Result:
|
|
|
|
- `repo-scoping`: 18 nodes, 17 edges, 13 attributes
|
|
- `llm-connect`: 5 nodes, 4 edges, 13 attributes
|
|
- `railiance-fabric`: 55 nodes, 63 edges, 13 attributes
|
|
- summary: 3 scanned, 0 changed, 0 retired, 0 conflicted, 3 LLM skipped,
|
|
0 LLM failed, 0 accepted, 0 errors
|
|
|
|
Follow-up backlog from this first pass:
|
|
|
|
- Add a standard discovery snapshot directory, likely `.fabric-discovery/`, so
|
|
repeated dry-runs can reconcile by default.
|
|
- Add a previous-from-registry option so manifest scans can diff against the
|
|
latest stored discovery snapshot without exporting JSON first.
|
|
- Expand runtime/deployment extraction beyond local manifests to cover live
|
|
server and deployment inventory connectors.
|
|
- Add review UI affordances for conflicts, tombstones, and bulk acceptance once
|
|
enough repos have baseline snapshots.
|
|
- Define privacy and budget defaults before enabling non-mock LLM providers in
|
|
multi-repo scans.
|
|
|
|
## Identity
|
|
|
|
Identity is the main safety boundary. The scanner must not append guesses on
|
|
every run. It needs to produce stable keys that are repeatable for the same
|
|
observed entity.
|
|
|
|
Candidate node keys use this shape:
|
|
|
|
```text
|
|
discovery:{repo_slug}:{entity_kind}:{normalized_name}[:source_fingerprint]
|
|
```
|
|
|
|
Use the optional source fingerprint when a name is too generic or when multiple
|
|
entities of the same kind can share a display name. Examples include HTTP
|
|
routes, generated clients, deployment manifests, and catalog records.
|
|
|
|
Candidate edge keys use a relationship fingerprint over:
|
|
|
|
- source stable key
|
|
- edge type
|
|
- target stable key
|
|
- optional evidence scope
|
|
|
|
Candidate attribute keys use the entity stable key plus the normalized
|
|
attribute name and, where needed, a source fingerprint.
|
|
|
|
Stable-key parts are lowercased and normalized to ASCII-like identity segments.
|
|
The helper functions in `railiance_fabric.discovery` define the initial rules.
|
|
|
|
## Source Anchors
|
|
|
|
Every candidate must carry one or more source anchors. A source anchor identifies
|
|
why the scanner believes the fact exists. Anchors can point to files, package
|
|
manifests, lockfiles, API contracts, deployment manifests, service catalogs,
|
|
registries, LLM evidence bundles, or manual review notes.
|
|
|
|
Source anchors include a fingerprint. The fingerprint should cover stable
|
|
location fields such as path, URL, ref, line range, or JSON pointer. Snippets are
|
|
useful for review but should not be the only identity anchor because formatting
|
|
noise can churn snippets.
|
|
|
|
## Replacement Scopes
|
|
|
|
A replacement scope says which extractor owns which set of candidates. Rescans
|
|
may retire missing candidates only inside the same scope.
|
|
|
|
Examples:
|
|
|
|
- `scope:repo-scoping:python-package:package_manifest:<hash>`
|
|
- `scope:state-hub:fabric-declarations:declaration`
|
|
- `scope:llm-connect:readme-summary:file:<hash>`
|
|
- `scope:railiance-fabric:local-registry:fabric_registry`
|
|
|
|
Scopes have a mode:
|
|
|
|
- `replacement`: candidates missing from the next run in the same scope become
|
|
tombstones.
|
|
- `additive`: candidates are added or updated, but absence does not retire old
|
|
candidates.
|
|
|
|
LLM extractors should usually use replacement mode only for tightly bounded
|
|
evidence bundles. Broad repo summaries are safer as additive or review-only
|
|
until the extraction prompts are proven stable.
|
|
|
|
## Merge Precedence
|
|
|
|
When multiple sources describe the same entity, reconciliation uses this
|
|
precedence:
|
|
|
|
1. `repo_declaration`
|
|
2. `deterministic`
|
|
3. `catalog`
|
|
4. `registry`
|
|
5. `llm`
|
|
6. `manual`
|
|
|
|
Manual review can override local candidate state, but it should not silently
|
|
rewrite repo-owned declarations. If accepted discoveries should become durable
|
|
repo-local evidence, generate a repo-owned declaration patch for human review.
|
|
If they affect financial ownership, fabric containment, tenancy, or utility
|
|
value boundaries, generate a baseline or accountability-root review item
|
|
instead.
|
|
|
|
## Duplicate Handling
|
|
|
|
The reconciler should merge candidates with the same stable key automatically.
|
|
It should also look for possible duplicates using:
|
|
|
|
- alias overlap
|
|
- identical source anchors
|
|
- identical evidence fingerprints
|
|
- normalized label similarity within the same entity kind
|
|
- relationship fingerprints with the same endpoints and edge type
|
|
- declaration ids that match discovery aliases
|
|
|
|
Exact stable-key matches can be merged automatically. Alias-only or
|
|
similarity-only matches should become `needs_review` conflicts unless an
|
|
extractor has a source-specific rule that makes the match deterministic.
|
|
|
|
## Rescan And Tombstones
|
|
|
|
On a rescan, the scanner compares the previous accepted discovery snapshot with
|
|
the newly produced snapshot for the same repo/profile.
|
|
|
|
- Same stable key: update in place.
|
|
- Same source anchor but changed attributes: update with changed evidence.
|
|
- Missing from same replacement scope: create a tombstone.
|
|
- Missing from a different scope: leave untouched.
|
|
- Reappears after tombstone: reactivate if the stable key and scope match.
|
|
- Reappears with a new key but same alias/source anchor: flag as possible
|
|
duplicate resurrection.
|
|
|
|
Tombstones explain graph drift and prevent immediate re-creation loops. They
|
|
should be retained long enough to compare several scan cycles and can later be
|
|
compacted by repo, extractor, or entity kind.
|
|
|
|
## Mapping To Fabric Graphs
|
|
|
|
Discovery candidates can project into the existing graph model when accepted:
|
|
|
|
- candidate service nodes map to `ServiceDeclaration`-like graph nodes
|
|
- candidate capabilities and interfaces map to provider surface nodes
|
|
- candidate dependencies map to dependency nodes and `consumes` edges
|
|
- candidate deployment/runtime entities map to graph explorer infrastructure
|
|
nodes until declarations gain first-class runtime support
|
|
- candidate libraries map to library inventory records and graph explorer nodes
|
|
|
|
If a repo-owned declaration already exists for the same entity, discovery output
|
|
should attach as supporting evidence instead of creating another node.
|
|
|
|
## LLM Boundary
|
|
|
|
LLM extraction through `llm-connect` is optional and schema-gated. The scanner
|
|
should use deterministic preselection to build small evidence bundles, ask for
|
|
structured JSON, validate the JSON against the discovery schema, and record:
|
|
|
|
- extractor id and version
|
|
- prompt version
|
|
- provider and model
|
|
- usage metadata
|
|
- confidence and uncertainty
|
|
- rationale
|
|
|
|
Malformed, low-confidence, or conflicting LLM output becomes review material,
|
|
not accepted graph data.
|