generated from coulomb/repo-seed
Add repo reality scanner workplan
This commit is contained in:
334
workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md
Normal file
334
workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md
Normal file
@@ -0,0 +1,334 @@
|
||||
---
|
||||
id: RAIL-FAB-WP-0010
|
||||
type: workplan
|
||||
title: "Repo Reality Scanner"
|
||||
domain: railiance
|
||||
repo: railiance-fabric
|
||||
status: proposed
|
||||
owner: codex
|
||||
topic_slug: railiance
|
||||
planning_priority: high
|
||||
planning_order: 10
|
||||
created: "2026-05-19"
|
||||
updated: "2026-05-19"
|
||||
---
|
||||
|
||||
# RAIL-FAB-WP-0010 - Repo Reality Scanner
|
||||
|
||||
## Goal
|
||||
|
||||
Create a facility that scans or rescans repositories and turns observed repo
|
||||
reality into Fabric graph entities: nodes, edges, and attributes. The scanner
|
||||
should combine deterministic extraction rules with LLM-assisted interpretation
|
||||
through `llm-connect`, so the Fabric graph can become more accurate over time
|
||||
without requiring every repository to hand-author perfect declarations first.
|
||||
|
||||
The facility must be safe for repeated rescans. It should avoid duplicate
|
||||
entities, update changed entities, and retire entities that are no longer
|
||||
observable. The result should be a disciplined discovery pipeline that can
|
||||
follow source files, manifests, service catalogs, package registries, API
|
||||
contracts, deployment descriptors, and other systematic sources before asking
|
||||
an LLM to infer higher-level structure.
|
||||
|
||||
## Background
|
||||
|
||||
Railiance Fabric currently treats repo-owned declarations as the authoritative
|
||||
graph source and the registry as a read/index layer. That remains the desired
|
||||
direction for high-confidence data. This workplan adds a discovery layer that
|
||||
can propose, maintain, and explain graph facts gathered from available repos.
|
||||
|
||||
The scanner should not create untraceable graph magic. Every inferred node,
|
||||
edge, and attribute needs source evidence, confidence, extraction method,
|
||||
stable identity, and replacement semantics. Deterministic rules should win
|
||||
where possible; LLM extraction should fill gaps, classify ambiguous evidence,
|
||||
and explain candidate relationships using structured outputs.
|
||||
|
||||
`llm-connect` provides a local Python integration surface for this:
|
||||
|
||||
- `create_adapter(provider, model=...)`
|
||||
- `RunConfig`
|
||||
- `MockLLMAdapter` for offline tests
|
||||
- provider resolution via TOML configuration
|
||||
- typed LLM errors and usage metadata
|
||||
|
||||
## Design Principles
|
||||
|
||||
- Prefer systematic evidence before LLM inference.
|
||||
- Preserve repo-owned declarations as higher-trust data than discovery output.
|
||||
- Make stable entity identity explicit and deterministic.
|
||||
- Treat rescans as replacement runs over a scoped source set, not as append-only
|
||||
guesses.
|
||||
- Keep duplicates from entering the graph by matching stable keys, aliases,
|
||||
source anchors, normalized names, and declared relationships.
|
||||
- Track stale or vanished evidence with tombstones or retired candidates rather
|
||||
than silently leaving old entities active.
|
||||
- Make every candidate explainable: source path, extractor, prompt version,
|
||||
evidence snippets or structured references, confidence, and review state.
|
||||
- Keep LLM prompts small, auditable, testable, and replaceable.
|
||||
- Support dry-run diffs before accepting discovery changes into the registry.
|
||||
|
||||
## Proposed Architecture
|
||||
|
||||
The scanner should produce a `FabricDiscoverySnapshot` for a repository and
|
||||
commit. A snapshot is a normalized candidate graph plus evidence metadata:
|
||||
|
||||
- candidate nodes
|
||||
- candidate edges
|
||||
- candidate attributes
|
||||
- source anchors
|
||||
- extraction provenance
|
||||
- identity keys and aliases
|
||||
- confidence/review state
|
||||
- replacement scope
|
||||
- tombstones for previously observed candidates no longer present
|
||||
|
||||
Accepted scanner output can then be projected into existing `FabricGraphExport`
|
||||
snapshots, graph explorer payloads, and registry drift APIs. Where repo-owned
|
||||
declarations already exist, discovered entities should be matched and attached
|
||||
as supporting evidence rather than duplicated.
|
||||
|
||||
## Identity And Rescan Model
|
||||
|
||||
The key hard problem is identity. Each entity needs:
|
||||
|
||||
- a canonical stable key derived from repo slug, entity kind, normalized name,
|
||||
and source-specific anchors when available
|
||||
- an alias set for names found in code, manifests, package metadata, route
|
||||
names, catalog ids, URLs, container names, deployment names, and service ids
|
||||
- evidence fingerprints for source fragments that explain why the entity
|
||||
exists
|
||||
- relationship keys based on normalized source id, target id, edge type, and
|
||||
evidence scope
|
||||
- a replacement scope that says which extractor owns which candidates for a
|
||||
repo/commit
|
||||
|
||||
On rescan:
|
||||
|
||||
- candidates with the same stable key are updated in place
|
||||
- candidates matching only by alias/fingerprint are merged conservatively and
|
||||
flagged for review
|
||||
- candidates missing from the same replacement scope become retired/tombstoned
|
||||
- repo-owned declarations override discovery candidates where they conflict
|
||||
- unreviewed LLM-only candidates remain candidate/review state, not accepted
|
||||
authoritative declarations
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 - Discovery Contract And Identity Model
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T01
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Define the discovery snapshot contract, identity keys, alias handling,
|
||||
replacement scopes, provenance fields, confidence/review states, and how
|
||||
discovered facts map into existing Fabric graph exports.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Add schema/docs for `FabricDiscoverySnapshot`, candidate nodes, candidate
|
||||
edges, candidate attributes, source anchors, extractor provenance, and
|
||||
tombstones.
|
||||
- Define stable-key generation rules for services, libraries, interfaces,
|
||||
dependencies, deployment/runtime entities, registries/catalog entries, and
|
||||
generic discovered entities.
|
||||
- Define merge precedence between repo-owned declarations, deterministic
|
||||
discoveries, registry/catalog discoveries, and LLM discoveries.
|
||||
- Define how duplicate candidates are detected and how uncertain matches are
|
||||
flagged for human review.
|
||||
- Define rescan replacement semantics so stale entities are retired only within
|
||||
the source scope that produced them.
|
||||
|
||||
### T02 - Deterministic Repo Extractor Framework
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T02
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Implement the first deterministic extraction framework for repo-local evidence
|
||||
before involving LLMs.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Add a scanner module and CLI surface that accepts repo path, repo slug,
|
||||
commit, scan profile, and dry-run/output options.
|
||||
- Extract systematic evidence from known file families such as package
|
||||
manifests, lockfiles, Docker/Compose files, OpenAPI/AsyncAPI files, Score
|
||||
files, Helm/Kubernetes manifests, service config files, existing `fabric/`
|
||||
declarations, README/INTENT/SCOPE metadata, and repository metadata.
|
||||
- Emit normalized candidate entities with source anchors and extractor ids.
|
||||
- Keep extractor rules independently testable with fixtures.
|
||||
- Avoid network access by default unless a connector profile explicitly enables
|
||||
registry/catalog lookups.
|
||||
|
||||
### T03 - LLM-Assisted Extraction Through llm-connect
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T03
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Add LLM-assisted extraction for ambiguous repo evidence using `llm-connect`
|
||||
with structured, auditable outputs.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Integrate `llm-connect` through a small adapter boundary using
|
||||
`create_adapter`, `RunConfig`, provider/model configuration, and
|
||||
`MockLLMAdapter` tests.
|
||||
- Use deterministic preselection to choose small evidence bundles for the LLM;
|
||||
do not send whole repos blindly.
|
||||
- Require structured JSON output with candidate nodes, edges, attributes,
|
||||
evidence references, confidence, uncertainty, and rationale.
|
||||
- Validate LLM output against the discovery schema before it can enter a
|
||||
snapshot.
|
||||
- Record prompt version, model, provider, usage metadata, and extraction run id.
|
||||
- Fail closed: malformed or low-confidence LLM output becomes a review artifact,
|
||||
not accepted graph data.
|
||||
|
||||
### T04 - Reconciliation, Deduplication, And Tombstones
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T04
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Build the reconciliation engine that merges deterministic, catalog, declaration,
|
||||
and LLM candidates into a coherent repo discovery snapshot.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Match candidates by stable key, source anchor, alias, normalized labels, and
|
||||
relationship fingerprints.
|
||||
- Merge attributes with source-aware precedence and conflict reporting.
|
||||
- Prevent duplicate nodes/edges from entering accepted scanner output.
|
||||
- Produce explicit added/changed/retired/conflicted candidate sets for dry-run
|
||||
review.
|
||||
- Retire vanished candidates only inside their extractor replacement scope.
|
||||
- Preserve historical tombstones long enough to explain graph drift and avoid
|
||||
immediate re-creation loops.
|
||||
|
||||
### T05 - Registry And Catalog Follow-Up Connectors
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T05
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Add connector slots for systematic follow-up against registries and catalogs
|
||||
where a repo points to more authoritative metadata.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Define connector interfaces for package registries, container registries,
|
||||
API catalogs, service catalogs, deployment inventories, and existing Fabric
|
||||
registry data.
|
||||
- Add at least one offline-safe prototype connector using local registry data or
|
||||
fixtures.
|
||||
- Keep connector evidence separate from repo-local evidence in replacement
|
||||
scopes.
|
||||
- Represent connector failures, rate limits, and unavailable catalogs without
|
||||
corrupting repo-local scan results.
|
||||
- Document when connector-derived facts should be accepted, candidate, or
|
||||
review-only.
|
||||
|
||||
### T06 - Registry Integration And Dry-Run Review
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T06
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Integrate discovery snapshots with the Fabric registry so scans can be reviewed,
|
||||
accepted, and reflected in graph queries without losing provenance.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Add storage/API support for latest discovery snapshots per repo/commit/profile.
|
||||
- Add dry-run diff output that shows candidate additions, changes, retirements,
|
||||
conflicts, duplicate merges, and confidence changes.
|
||||
- Add an accept path that projects accepted discovery output into the combined
|
||||
graph without overwriting repo-owned declarations.
|
||||
- Expose discovery provenance and review state through inventory, graph, drift,
|
||||
and graph explorer payloads.
|
||||
- Preserve existing registry snapshot replacement semantics for accepted graph
|
||||
exports.
|
||||
|
||||
### T07 - Multi-Repo Scan Orchestration
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T07
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Add orchestration so the scanner can run across the known local repo manifests
|
||||
and improve Fabric coverage over time.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Extend or complement `registry sync-manifest` with scan/rescan commands.
|
||||
- Support repo allowlists, scan profiles, max LLM budget, deterministic-only
|
||||
mode, and dry-run mode.
|
||||
- Produce a concise summary across repos: scanned, changed, retired, conflicted,
|
||||
LLM skipped, LLM failed, and accepted.
|
||||
- Keep one repo failure from aborting the entire multi-repo run.
|
||||
- Record enough metadata for State Hub progress notes and future automation.
|
||||
|
||||
### T08 - Tests, Fixtures, Documentation, And Rollout
|
||||
|
||||
```task
|
||||
id: RAIL-FAB-WP-0010-T08
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Harden the scanner with fixture coverage and clear adoption guidance before
|
||||
using it broadly.
|
||||
|
||||
Acceptance notes:
|
||||
|
||||
- Add fixture repos that cover deterministic-only discovery, LLM-assisted
|
||||
discovery, duplicate prevention, vanished entity retirement, declaration
|
||||
override, and connector evidence.
|
||||
- Test rescans across at least three commits/snapshots to prove duplicates are
|
||||
not created and stale entities are retired.
|
||||
- Test with `MockLLMAdapter` so CI does not require network or provider keys.
|
||||
- Document scanner commands, scan profiles, identity model, review workflow,
|
||||
LLM configuration, and failure modes.
|
||||
- Run the first dry-run scan against a small set of local repos and record the
|
||||
resulting implementation backlog.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should discovered entities live only as registry-side candidates, or should
|
||||
accepted discoveries eventually generate repo-owned declaration patches?
|
||||
- Which source types should be accepted without human review, and which should
|
||||
always remain candidates?
|
||||
- How long should tombstones be retained, and should retention be per repo,
|
||||
extractor, or entity kind?
|
||||
- Should LLM extraction be synchronous in CLI scans, queued as background work,
|
||||
or both?
|
||||
- What budget and privacy controls are required before sending repo evidence to
|
||||
an external model provider?
|
||||
|
||||
## Close Criteria
|
||||
|
||||
- A repo can be scanned, rescanned, and diffed without creating duplicate graph
|
||||
entities.
|
||||
- Removed repo evidence causes scoped candidate retirements instead of stale
|
||||
active entities.
|
||||
- LLM-assisted extraction is optional, schema-validated, provenance-rich, and
|
||||
testable offline through `MockLLMAdapter`.
|
||||
- Registry and graph explorer surfaces can show discovered facts with confidence
|
||||
and review status.
|
||||
- The first multi-repo dry run produces useful candidate graph improvements and
|
||||
a clear next implementation backlog.
|
||||
Reference in New Issue
Block a user