Add repo reality scanner workplan

This commit is contained in:
2026-05-19 02:26:02 +02:00
parent edf2062a7a
commit 5523b8f493

View File

@@ -0,0 +1,334 @@
---
id: RAIL-FAB-WP-0010
type: workplan
title: "Repo Reality Scanner"
domain: railiance
repo: railiance-fabric
status: proposed
owner: codex
topic_slug: railiance
planning_priority: high
planning_order: 10
created: "2026-05-19"
updated: "2026-05-19"
---
# RAIL-FAB-WP-0010 - Repo Reality Scanner
## Goal
Create a facility that scans or rescans repositories and turns observed repo
reality into Fabric graph entities: nodes, edges, and attributes. The scanner
should combine deterministic extraction rules with LLM-assisted interpretation
through `llm-connect`, so the Fabric graph can become more accurate over time
without requiring every repository to hand-author perfect declarations first.
The facility must be safe for repeated rescans. It should avoid duplicate
entities, update changed entities, and retire entities that are no longer
observable. The result should be a disciplined discovery pipeline that can
follow source files, manifests, service catalogs, package registries, API
contracts, deployment descriptors, and other systematic sources before asking
an LLM to infer higher-level structure.
## Background
Railiance Fabric currently treats repo-owned declarations as the authoritative
graph source and the registry as a read/index layer. That remains the desired
direction for high-confidence data. This workplan adds a discovery layer that
can propose, maintain, and explain graph facts gathered from available repos.
The scanner should not create untraceable graph magic. Every inferred node,
edge, and attribute needs source evidence, confidence, extraction method,
stable identity, and replacement semantics. Deterministic rules should win
where possible; LLM extraction should fill gaps, classify ambiguous evidence,
and explain candidate relationships using structured outputs.
`llm-connect` provides a local Python integration surface for this:
- `create_adapter(provider, model=...)`
- `RunConfig`
- `MockLLMAdapter` for offline tests
- provider resolution via TOML configuration
- typed LLM errors and usage metadata
## Design Principles
- Prefer systematic evidence before LLM inference.
- Preserve repo-owned declarations as higher-trust data than discovery output.
- Make stable entity identity explicit and deterministic.
- Treat rescans as replacement runs over a scoped source set, not as append-only
guesses.
- Keep duplicates from entering the graph by matching stable keys, aliases,
source anchors, normalized names, and declared relationships.
- Track stale or vanished evidence with tombstones or retired candidates rather
than silently leaving old entities active.
- Make every candidate explainable: source path, extractor, prompt version,
evidence snippets or structured references, confidence, and review state.
- Keep LLM prompts small, auditable, testable, and replaceable.
- Support dry-run diffs before accepting discovery changes into the registry.
## Proposed Architecture
The scanner should produce a `FabricDiscoverySnapshot` for a repository and
commit. A snapshot is a normalized candidate graph plus evidence metadata:
- candidate nodes
- candidate edges
- candidate attributes
- source anchors
- extraction provenance
- identity keys and aliases
- confidence/review state
- replacement scope
- tombstones for previously observed candidates no longer present
Accepted scanner output can then be projected into existing `FabricGraphExport`
snapshots, graph explorer payloads, and registry drift APIs. Where repo-owned
declarations already exist, discovered entities should be matched and attached
as supporting evidence rather than duplicated.
## Identity And Rescan Model
The key hard problem is identity. Each entity needs:
- a canonical stable key derived from repo slug, entity kind, normalized name,
and source-specific anchors when available
- an alias set for names found in code, manifests, package metadata, route
names, catalog ids, URLs, container names, deployment names, and service ids
- evidence fingerprints for source fragments that explain why the entity
exists
- relationship keys based on normalized source id, target id, edge type, and
evidence scope
- a replacement scope that says which extractor owns which candidates for a
repo/commit
On rescan:
- candidates with the same stable key are updated in place
- candidates matching only by alias/fingerprint are merged conservatively and
flagged for review
- candidates missing from the same replacement scope become retired/tombstoned
- repo-owned declarations override discovery candidates where they conflict
- unreviewed LLM-only candidates remain candidate/review state, not accepted
authoritative declarations
## Tasks
### T01 - Discovery Contract And Identity Model
```task
id: RAIL-FAB-WP-0010-T01
status: todo
priority: high
```
Define the discovery snapshot contract, identity keys, alias handling,
replacement scopes, provenance fields, confidence/review states, and how
discovered facts map into existing Fabric graph exports.
Acceptance notes:
- Add schema/docs for `FabricDiscoverySnapshot`, candidate nodes, candidate
edges, candidate attributes, source anchors, extractor provenance, and
tombstones.
- Define stable-key generation rules for services, libraries, interfaces,
dependencies, deployment/runtime entities, registries/catalog entries, and
generic discovered entities.
- Define merge precedence between repo-owned declarations, deterministic
discoveries, registry/catalog discoveries, and LLM discoveries.
- Define how duplicate candidates are detected and how uncertain matches are
flagged for human review.
- Define rescan replacement semantics so stale entities are retired only within
the source scope that produced them.
### T02 - Deterministic Repo Extractor Framework
```task
id: RAIL-FAB-WP-0010-T02
status: todo
priority: high
```
Implement the first deterministic extraction framework for repo-local evidence
before involving LLMs.
Acceptance notes:
- Add a scanner module and CLI surface that accepts repo path, repo slug,
commit, scan profile, and dry-run/output options.
- Extract systematic evidence from known file families such as package
manifests, lockfiles, Docker/Compose files, OpenAPI/AsyncAPI files, Score
files, Helm/Kubernetes manifests, service config files, existing `fabric/`
declarations, README/INTENT/SCOPE metadata, and repository metadata.
- Emit normalized candidate entities with source anchors and extractor ids.
- Keep extractor rules independently testable with fixtures.
- Avoid network access by default unless a connector profile explicitly enables
registry/catalog lookups.
### T03 - LLM-Assisted Extraction Through llm-connect
```task
id: RAIL-FAB-WP-0010-T03
status: todo
priority: high
```
Add LLM-assisted extraction for ambiguous repo evidence using `llm-connect`
with structured, auditable outputs.
Acceptance notes:
- Integrate `llm-connect` through a small adapter boundary using
`create_adapter`, `RunConfig`, provider/model configuration, and
`MockLLMAdapter` tests.
- Use deterministic preselection to choose small evidence bundles for the LLM;
do not send whole repos blindly.
- Require structured JSON output with candidate nodes, edges, attributes,
evidence references, confidence, uncertainty, and rationale.
- Validate LLM output against the discovery schema before it can enter a
snapshot.
- Record prompt version, model, provider, usage metadata, and extraction run id.
- Fail closed: malformed or low-confidence LLM output becomes a review artifact,
not accepted graph data.
### T04 - Reconciliation, Deduplication, And Tombstones
```task
id: RAIL-FAB-WP-0010-T04
status: todo
priority: high
```
Build the reconciliation engine that merges deterministic, catalog, declaration,
and LLM candidates into a coherent repo discovery snapshot.
Acceptance notes:
- Match candidates by stable key, source anchor, alias, normalized labels, and
relationship fingerprints.
- Merge attributes with source-aware precedence and conflict reporting.
- Prevent duplicate nodes/edges from entering accepted scanner output.
- Produce explicit added/changed/retired/conflicted candidate sets for dry-run
review.
- Retire vanished candidates only inside their extractor replacement scope.
- Preserve historical tombstones long enough to explain graph drift and avoid
immediate re-creation loops.
### T05 - Registry And Catalog Follow-Up Connectors
```task
id: RAIL-FAB-WP-0010-T05
status: todo
priority: medium
```
Add connector slots for systematic follow-up against registries and catalogs
where a repo points to more authoritative metadata.
Acceptance notes:
- Define connector interfaces for package registries, container registries,
API catalogs, service catalogs, deployment inventories, and existing Fabric
registry data.
- Add at least one offline-safe prototype connector using local registry data or
fixtures.
- Keep connector evidence separate from repo-local evidence in replacement
scopes.
- Represent connector failures, rate limits, and unavailable catalogs without
corrupting repo-local scan results.
- Document when connector-derived facts should be accepted, candidate, or
review-only.
### T06 - Registry Integration And Dry-Run Review
```task
id: RAIL-FAB-WP-0010-T06
status: todo
priority: high
```
Integrate discovery snapshots with the Fabric registry so scans can be reviewed,
accepted, and reflected in graph queries without losing provenance.
Acceptance notes:
- Add storage/API support for latest discovery snapshots per repo/commit/profile.
- Add dry-run diff output that shows candidate additions, changes, retirements,
conflicts, duplicate merges, and confidence changes.
- Add an accept path that projects accepted discovery output into the combined
graph without overwriting repo-owned declarations.
- Expose discovery provenance and review state through inventory, graph, drift,
and graph explorer payloads.
- Preserve existing registry snapshot replacement semantics for accepted graph
exports.
### T07 - Multi-Repo Scan Orchestration
```task
id: RAIL-FAB-WP-0010-T07
status: todo
priority: medium
```
Add orchestration so the scanner can run across the known local repo manifests
and improve Fabric coverage over time.
Acceptance notes:
- Extend or complement `registry sync-manifest` with scan/rescan commands.
- Support repo allowlists, scan profiles, max LLM budget, deterministic-only
mode, and dry-run mode.
- Produce a concise summary across repos: scanned, changed, retired, conflicted,
LLM skipped, LLM failed, and accepted.
- Keep one repo failure from aborting the entire multi-repo run.
- Record enough metadata for State Hub progress notes and future automation.
### T08 - Tests, Fixtures, Documentation, And Rollout
```task
id: RAIL-FAB-WP-0010-T08
status: todo
priority: medium
```
Harden the scanner with fixture coverage and clear adoption guidance before
using it broadly.
Acceptance notes:
- Add fixture repos that cover deterministic-only discovery, LLM-assisted
discovery, duplicate prevention, vanished entity retirement, declaration
override, and connector evidence.
- Test rescans across at least three commits/snapshots to prove duplicates are
not created and stale entities are retired.
- Test with `MockLLMAdapter` so CI does not require network or provider keys.
- Document scanner commands, scan profiles, identity model, review workflow,
LLM configuration, and failure modes.
- Run the first dry-run scan against a small set of local repos and record the
resulting implementation backlog.
## Open Questions
- Should discovered entities live only as registry-side candidates, or should
accepted discoveries eventually generate repo-owned declaration patches?
- Which source types should be accepted without human review, and which should
always remain candidates?
- How long should tombstones be retained, and should retention be per repo,
extractor, or entity kind?
- Should LLM extraction be synchronous in CLI scans, queued as background work,
or both?
- What budget and privacy controls are required before sending repo evidence to
an external model provider?
## Close Criteria
- A repo can be scanned, rescanned, and diffed without creating duplicate graph
entities.
- Removed repo evidence causes scoped candidate retirements instead of stale
active entities.
- LLM-assisted extraction is optional, schema-validated, provenance-rich, and
testable offline through `MockLLMAdapter`.
- Registry and graph explorer surfaces can show discovered facts with confidence
and review status.
- The first multi-repo dry run produces useful candidate graph improvements and
a clear next implementation backlog.