13 KiB
id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | planning_priority | planning_order | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RAIL-FAB-WP-0010 | workplan | Repo Reality Scanner | railiance | railiance-fabric | active | codex | railiance | high | 10 | 2026-05-19 | 2026-05-19 | ac88ac5e-7183-47ba-ac05-1a1b44cfd47e |
RAIL-FAB-WP-0010 - Repo Reality Scanner
Goal
Create a facility that scans or rescans repositories and turns observed repo
reality into Fabric graph entities: nodes, edges, and attributes. The scanner
should combine deterministic extraction rules with LLM-assisted interpretation
through llm-connect, so the Fabric graph can become more accurate over time
without requiring every repository to hand-author perfect declarations first.
The facility must be safe for repeated rescans. It should avoid duplicate entities, update changed entities, and retire entities that are no longer observable. The result should be a disciplined discovery pipeline that can follow source files, manifests, service catalogs, package registries, API contracts, deployment descriptors, and other systematic sources before asking an LLM to infer higher-level structure.
Background
Railiance Fabric currently treats repo-owned declarations as the authoritative graph source and the registry as a read/index layer. That remains the desired direction for high-confidence data. This workplan adds a discovery layer that can propose, maintain, and explain graph facts gathered from available repos.
The scanner should not create untraceable graph magic. Every inferred node, edge, and attribute needs source evidence, confidence, extraction method, stable identity, and replacement semantics. Deterministic rules should win where possible; LLM extraction should fill gaps, classify ambiguous evidence, and explain candidate relationships using structured outputs.
llm-connect provides a local Python integration surface for this:
create_adapter(provider, model=...)RunConfigMockLLMAdapterfor offline tests- provider resolution via TOML configuration
- typed LLM errors and usage metadata
Design Principles
- Prefer systematic evidence before LLM inference.
- Preserve repo-owned declarations as higher-trust data than discovery output.
- Make stable entity identity explicit and deterministic.
- Treat rescans as replacement runs over a scoped source set, not as append-only guesses.
- Keep duplicates from entering the graph by matching stable keys, aliases, source anchors, normalized names, and declared relationships.
- Track stale or vanished evidence with tombstones or retired candidates rather than silently leaving old entities active.
- Make every candidate explainable: source path, extractor, prompt version, evidence snippets or structured references, confidence, and review state.
- Keep LLM prompts small, auditable, testable, and replaceable.
- Support dry-run diffs before accepting discovery changes into the registry.
Proposed Architecture
The scanner should produce a FabricDiscoverySnapshot for a repository and
commit. A snapshot is a normalized candidate graph plus evidence metadata:
- candidate nodes
- candidate edges
- candidate attributes
- source anchors
- extraction provenance
- identity keys and aliases
- confidence/review state
- replacement scope
- tombstones for previously observed candidates no longer present
Accepted scanner output can then be projected into existing FabricGraphExport
snapshots, graph explorer payloads, and registry drift APIs. Where repo-owned
declarations already exist, discovered entities should be matched and attached
as supporting evidence rather than duplicated.
Identity And Rescan Model
The key hard problem is identity. Each entity needs:
- a canonical stable key derived from repo slug, entity kind, normalized name, and source-specific anchors when available
- an alias set for names found in code, manifests, package metadata, route names, catalog ids, URLs, container names, deployment names, and service ids
- evidence fingerprints for source fragments that explain why the entity exists
- relationship keys based on normalized source id, target id, edge type, and evidence scope
- a replacement scope that says which extractor owns which candidates for a repo/commit
On rescan:
- candidates with the same stable key are updated in place
- candidates matching only by alias/fingerprint are merged conservatively and flagged for review
- candidates missing from the same replacement scope become retired/tombstoned
- repo-owned declarations override discovery candidates where they conflict
- unreviewed LLM-only candidates remain candidate/review state, not accepted authoritative declarations
Tasks
T01 - Discovery Contract And Identity Model
id: RAIL-FAB-WP-0010-T01
status: done
priority: high
state_hub_task_id: "d77423fa-a47f-4246-86bd-ea1ca2d17bc4"
Define the discovery snapshot contract, identity keys, alias handling, replacement scopes, provenance fields, confidence/review states, and how discovered facts map into existing Fabric graph exports.
Acceptance notes:
- Add schema/docs for
FabricDiscoverySnapshot, candidate nodes, candidate edges, candidate attributes, source anchors, extractor provenance, and tombstones. - Define stable-key generation rules for services, libraries, interfaces, dependencies, deployment/runtime entities, registries/catalog entries, and generic discovered entities.
- Define merge precedence between repo-owned declarations, deterministic discoveries, registry/catalog discoveries, and LLM discoveries.
- Define how duplicate candidates are detected and how uncertain matches are flagged for human review.
- Define rescan replacement semantics so stale entities are retired only within the source scope that produced them.
T02 - Deterministic Repo Extractor Framework
id: RAIL-FAB-WP-0010-T02
status: done
priority: high
state_hub_task_id: "5d2ff304-9c79-4699-bf8c-ed6db3a90d9f"
Implement the first deterministic extraction framework for repo-local evidence before involving LLMs.
Acceptance notes:
- Add a scanner module and CLI surface that accepts repo path, repo slug, commit, scan profile, and dry-run/output options.
- Extract systematic evidence from known file families such as package
manifests, lockfiles, Docker/Compose files, OpenAPI/AsyncAPI files, Score
files, Helm/Kubernetes manifests, service config files, existing
fabric/declarations, README/INTENT/SCOPE metadata, and repository metadata. - Emit normalized candidate entities with source anchors and extractor ids.
- Keep extractor rules independently testable with fixtures.
- Avoid network access by default unless a connector profile explicitly enables registry/catalog lookups.
T03 - LLM-Assisted Extraction Through llm-connect
id: RAIL-FAB-WP-0010-T03
status: done
priority: high
state_hub_task_id: "59c206a3-94b9-4f47-9c4f-75f87aa8f505"
Add LLM-assisted extraction for ambiguous repo evidence using llm-connect
with structured, auditable outputs.
Acceptance notes:
- Integrate
llm-connectthrough a small adapter boundary usingcreate_adapter,RunConfig, provider/model configuration, andMockLLMAdaptertests. - Use deterministic preselection to choose small evidence bundles for the LLM; do not send whole repos blindly.
- Require structured JSON output with candidate nodes, edges, attributes, evidence references, confidence, uncertainty, and rationale.
- Validate LLM output against the discovery schema before it can enter a snapshot.
- Record prompt version, model, provider, usage metadata, and extraction run id.
- Fail closed: malformed or low-confidence LLM output becomes a review artifact, not accepted graph data.
T04 - Reconciliation, Deduplication, And Tombstones
id: RAIL-FAB-WP-0010-T04
status: todo
priority: high
state_hub_task_id: "f0844595-23e0-4e7a-bfd9-e0526b8f85b9"
Build the reconciliation engine that merges deterministic, catalog, declaration, and LLM candidates into a coherent repo discovery snapshot.
Acceptance notes:
- Match candidates by stable key, source anchor, alias, normalized labels, and relationship fingerprints.
- Merge attributes with source-aware precedence and conflict reporting.
- Prevent duplicate nodes/edges from entering accepted scanner output.
- Produce explicit added/changed/retired/conflicted candidate sets for dry-run review.
- Retire vanished candidates only inside their extractor replacement scope.
- Preserve historical tombstones long enough to explain graph drift and avoid immediate re-creation loops.
T05 - Registry And Catalog Follow-Up Connectors
id: RAIL-FAB-WP-0010-T05
status: todo
priority: medium
state_hub_task_id: "d664301d-c531-4cf8-a1dd-cbadda0e0fdb"
Add connector slots for systematic follow-up against registries and catalogs where a repo points to more authoritative metadata.
Acceptance notes:
- Define connector interfaces for package registries, container registries, API catalogs, service catalogs, deployment inventories, and existing Fabric registry data.
- Add at least one offline-safe prototype connector using local registry data or fixtures.
- Keep connector evidence separate from repo-local evidence in replacement scopes.
- Represent connector failures, rate limits, and unavailable catalogs without corrupting repo-local scan results.
- Document when connector-derived facts should be accepted, candidate, or review-only.
T06 - Registry Integration And Dry-Run Review
id: RAIL-FAB-WP-0010-T06
status: todo
priority: high
state_hub_task_id: "9a8420f1-0072-4f40-8d0f-775f59cbe772"
Integrate discovery snapshots with the Fabric registry so scans can be reviewed, accepted, and reflected in graph queries without losing provenance.
Acceptance notes:
- Add storage/API support for latest discovery snapshots per repo/commit/profile.
- Add dry-run diff output that shows candidate additions, changes, retirements, conflicts, duplicate merges, and confidence changes.
- Add an accept path that projects accepted discovery output into the combined graph without overwriting repo-owned declarations.
- Expose discovery provenance and review state through inventory, graph, drift, and graph explorer payloads.
- Preserve existing registry snapshot replacement semantics for accepted graph exports.
T07 - Multi-Repo Scan Orchestration
id: RAIL-FAB-WP-0010-T07
status: todo
priority: medium
state_hub_task_id: "28014246-0a64-4d69-8065-98de881bffb4"
Add orchestration so the scanner can run across the known local repo manifests and improve Fabric coverage over time.
Acceptance notes:
- Extend or complement
registry sync-manifestwith scan/rescan commands. - Support repo allowlists, scan profiles, max LLM budget, deterministic-only mode, and dry-run mode.
- Produce a concise summary across repos: scanned, changed, retired, conflicted, LLM skipped, LLM failed, and accepted.
- Keep one repo failure from aborting the entire multi-repo run.
- Record enough metadata for State Hub progress notes and future automation.
T08 - Tests, Fixtures, Documentation, And Rollout
id: RAIL-FAB-WP-0010-T08
status: todo
priority: medium
state_hub_task_id: "7a5b7dd7-92c6-4ac5-ae4d-6e73f75aac0d"
Harden the scanner with fixture coverage and clear adoption guidance before using it broadly.
Acceptance notes:
- Add fixture repos that cover deterministic-only discovery, LLM-assisted discovery, duplicate prevention, vanished entity retirement, declaration override, and connector evidence.
- Test rescans across at least three commits/snapshots to prove duplicates are not created and stale entities are retired.
- Test with
MockLLMAdapterso CI does not require network or provider keys. - Document scanner commands, scan profiles, identity model, review workflow, LLM configuration, and failure modes.
- Run the first dry-run scan against a small set of local repos and record the resulting implementation backlog.
Open Questions
- Should discovered entities live only as registry-side candidates, or should accepted discoveries eventually generate repo-owned declaration patches?
- Which source types should be accepted without human review, and which should always remain candidates?
- How long should tombstones be retained, and should retention be per repo, extractor, or entity kind?
- Should LLM extraction be synchronous in CLI scans, queued as background work, or both?
- What budget and privacy controls are required before sending repo evidence to an external model provider?
Close Criteria
- A repo can be scanned, rescanned, and diffed without creating duplicate graph entities.
- Removed repo evidence causes scoped candidate retirements instead of stale active entities.
- LLM-assisted extraction is optional, schema-validated, provenance-rich, and
testable offline through
MockLLMAdapter. - Registry and graph explorer surfaces can show discovered facts with confidence and review status.
- The first multi-repo dry run produces useful candidate graph improvements and a clear next implementation backlog.