diff --git a/workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md b/workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md new file mode 100644 index 0000000..80f6047 --- /dev/null +++ b/workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md @@ -0,0 +1,334 @@ +--- +id: RAIL-FAB-WP-0010 +type: workplan +title: "Repo Reality Scanner" +domain: railiance +repo: railiance-fabric +status: proposed +owner: codex +topic_slug: railiance +planning_priority: high +planning_order: 10 +created: "2026-05-19" +updated: "2026-05-19" +--- + +# RAIL-FAB-WP-0010 - Repo Reality Scanner + +## Goal + +Create a facility that scans or rescans repositories and turns observed repo +reality into Fabric graph entities: nodes, edges, and attributes. The scanner +should combine deterministic extraction rules with LLM-assisted interpretation +through `llm-connect`, so the Fabric graph can become more accurate over time +without requiring every repository to hand-author perfect declarations first. + +The facility must be safe for repeated rescans. It should avoid duplicate +entities, update changed entities, and retire entities that are no longer +observable. The result should be a disciplined discovery pipeline that can +follow source files, manifests, service catalogs, package registries, API +contracts, deployment descriptors, and other systematic sources before asking +an LLM to infer higher-level structure. + +## Background + +Railiance Fabric currently treats repo-owned declarations as the authoritative +graph source and the registry as a read/index layer. That remains the desired +direction for high-confidence data. This workplan adds a discovery layer that +can propose, maintain, and explain graph facts gathered from available repos. + +The scanner should not create untraceable graph magic. Every inferred node, +edge, and attribute needs source evidence, confidence, extraction method, +stable identity, and replacement semantics. Deterministic rules should win +where possible; LLM extraction should fill gaps, classify ambiguous evidence, +and explain candidate relationships using structured outputs. + +`llm-connect` provides a local Python integration surface for this: + +- `create_adapter(provider, model=...)` +- `RunConfig` +- `MockLLMAdapter` for offline tests +- provider resolution via TOML configuration +- typed LLM errors and usage metadata + +## Design Principles + +- Prefer systematic evidence before LLM inference. +- Preserve repo-owned declarations as higher-trust data than discovery output. +- Make stable entity identity explicit and deterministic. +- Treat rescans as replacement runs over a scoped source set, not as append-only + guesses. +- Keep duplicates from entering the graph by matching stable keys, aliases, + source anchors, normalized names, and declared relationships. +- Track stale or vanished evidence with tombstones or retired candidates rather + than silently leaving old entities active. +- Make every candidate explainable: source path, extractor, prompt version, + evidence snippets or structured references, confidence, and review state. +- Keep LLM prompts small, auditable, testable, and replaceable. +- Support dry-run diffs before accepting discovery changes into the registry. + +## Proposed Architecture + +The scanner should produce a `FabricDiscoverySnapshot` for a repository and +commit. A snapshot is a normalized candidate graph plus evidence metadata: + +- candidate nodes +- candidate edges +- candidate attributes +- source anchors +- extraction provenance +- identity keys and aliases +- confidence/review state +- replacement scope +- tombstones for previously observed candidates no longer present + +Accepted scanner output can then be projected into existing `FabricGraphExport` +snapshots, graph explorer payloads, and registry drift APIs. Where repo-owned +declarations already exist, discovered entities should be matched and attached +as supporting evidence rather than duplicated. + +## Identity And Rescan Model + +The key hard problem is identity. Each entity needs: + +- a canonical stable key derived from repo slug, entity kind, normalized name, + and source-specific anchors when available +- an alias set for names found in code, manifests, package metadata, route + names, catalog ids, URLs, container names, deployment names, and service ids +- evidence fingerprints for source fragments that explain why the entity + exists +- relationship keys based on normalized source id, target id, edge type, and + evidence scope +- a replacement scope that says which extractor owns which candidates for a + repo/commit + +On rescan: + +- candidates with the same stable key are updated in place +- candidates matching only by alias/fingerprint are merged conservatively and + flagged for review +- candidates missing from the same replacement scope become retired/tombstoned +- repo-owned declarations override discovery candidates where they conflict +- unreviewed LLM-only candidates remain candidate/review state, not accepted + authoritative declarations + +## Tasks + +### T01 - Discovery Contract And Identity Model + +```task +id: RAIL-FAB-WP-0010-T01 +status: todo +priority: high +``` + +Define the discovery snapshot contract, identity keys, alias handling, +replacement scopes, provenance fields, confidence/review states, and how +discovered facts map into existing Fabric graph exports. + +Acceptance notes: + +- Add schema/docs for `FabricDiscoverySnapshot`, candidate nodes, candidate + edges, candidate attributes, source anchors, extractor provenance, and + tombstones. +- Define stable-key generation rules for services, libraries, interfaces, + dependencies, deployment/runtime entities, registries/catalog entries, and + generic discovered entities. +- Define merge precedence between repo-owned declarations, deterministic + discoveries, registry/catalog discoveries, and LLM discoveries. +- Define how duplicate candidates are detected and how uncertain matches are + flagged for human review. +- Define rescan replacement semantics so stale entities are retired only within + the source scope that produced them. + +### T02 - Deterministic Repo Extractor Framework + +```task +id: RAIL-FAB-WP-0010-T02 +status: todo +priority: high +``` + +Implement the first deterministic extraction framework for repo-local evidence +before involving LLMs. + +Acceptance notes: + +- Add a scanner module and CLI surface that accepts repo path, repo slug, + commit, scan profile, and dry-run/output options. +- Extract systematic evidence from known file families such as package + manifests, lockfiles, Docker/Compose files, OpenAPI/AsyncAPI files, Score + files, Helm/Kubernetes manifests, service config files, existing `fabric/` + declarations, README/INTENT/SCOPE metadata, and repository metadata. +- Emit normalized candidate entities with source anchors and extractor ids. +- Keep extractor rules independently testable with fixtures. +- Avoid network access by default unless a connector profile explicitly enables + registry/catalog lookups. + +### T03 - LLM-Assisted Extraction Through llm-connect + +```task +id: RAIL-FAB-WP-0010-T03 +status: todo +priority: high +``` + +Add LLM-assisted extraction for ambiguous repo evidence using `llm-connect` +with structured, auditable outputs. + +Acceptance notes: + +- Integrate `llm-connect` through a small adapter boundary using + `create_adapter`, `RunConfig`, provider/model configuration, and + `MockLLMAdapter` tests. +- Use deterministic preselection to choose small evidence bundles for the LLM; + do not send whole repos blindly. +- Require structured JSON output with candidate nodes, edges, attributes, + evidence references, confidence, uncertainty, and rationale. +- Validate LLM output against the discovery schema before it can enter a + snapshot. +- Record prompt version, model, provider, usage metadata, and extraction run id. +- Fail closed: malformed or low-confidence LLM output becomes a review artifact, + not accepted graph data. + +### T04 - Reconciliation, Deduplication, And Tombstones + +```task +id: RAIL-FAB-WP-0010-T04 +status: todo +priority: high +``` + +Build the reconciliation engine that merges deterministic, catalog, declaration, +and LLM candidates into a coherent repo discovery snapshot. + +Acceptance notes: + +- Match candidates by stable key, source anchor, alias, normalized labels, and + relationship fingerprints. +- Merge attributes with source-aware precedence and conflict reporting. +- Prevent duplicate nodes/edges from entering accepted scanner output. +- Produce explicit added/changed/retired/conflicted candidate sets for dry-run + review. +- Retire vanished candidates only inside their extractor replacement scope. +- Preserve historical tombstones long enough to explain graph drift and avoid + immediate re-creation loops. + +### T05 - Registry And Catalog Follow-Up Connectors + +```task +id: RAIL-FAB-WP-0010-T05 +status: todo +priority: medium +``` + +Add connector slots for systematic follow-up against registries and catalogs +where a repo points to more authoritative metadata. + +Acceptance notes: + +- Define connector interfaces for package registries, container registries, + API catalogs, service catalogs, deployment inventories, and existing Fabric + registry data. +- Add at least one offline-safe prototype connector using local registry data or + fixtures. +- Keep connector evidence separate from repo-local evidence in replacement + scopes. +- Represent connector failures, rate limits, and unavailable catalogs without + corrupting repo-local scan results. +- Document when connector-derived facts should be accepted, candidate, or + review-only. + +### T06 - Registry Integration And Dry-Run Review + +```task +id: RAIL-FAB-WP-0010-T06 +status: todo +priority: high +``` + +Integrate discovery snapshots with the Fabric registry so scans can be reviewed, +accepted, and reflected in graph queries without losing provenance. + +Acceptance notes: + +- Add storage/API support for latest discovery snapshots per repo/commit/profile. +- Add dry-run diff output that shows candidate additions, changes, retirements, + conflicts, duplicate merges, and confidence changes. +- Add an accept path that projects accepted discovery output into the combined + graph without overwriting repo-owned declarations. +- Expose discovery provenance and review state through inventory, graph, drift, + and graph explorer payloads. +- Preserve existing registry snapshot replacement semantics for accepted graph + exports. + +### T07 - Multi-Repo Scan Orchestration + +```task +id: RAIL-FAB-WP-0010-T07 +status: todo +priority: medium +``` + +Add orchestration so the scanner can run across the known local repo manifests +and improve Fabric coverage over time. + +Acceptance notes: + +- Extend or complement `registry sync-manifest` with scan/rescan commands. +- Support repo allowlists, scan profiles, max LLM budget, deterministic-only + mode, and dry-run mode. +- Produce a concise summary across repos: scanned, changed, retired, conflicted, + LLM skipped, LLM failed, and accepted. +- Keep one repo failure from aborting the entire multi-repo run. +- Record enough metadata for State Hub progress notes and future automation. + +### T08 - Tests, Fixtures, Documentation, And Rollout + +```task +id: RAIL-FAB-WP-0010-T08 +status: todo +priority: medium +``` + +Harden the scanner with fixture coverage and clear adoption guidance before +using it broadly. + +Acceptance notes: + +- Add fixture repos that cover deterministic-only discovery, LLM-assisted + discovery, duplicate prevention, vanished entity retirement, declaration + override, and connector evidence. +- Test rescans across at least three commits/snapshots to prove duplicates are + not created and stale entities are retired. +- Test with `MockLLMAdapter` so CI does not require network or provider keys. +- Document scanner commands, scan profiles, identity model, review workflow, + LLM configuration, and failure modes. +- Run the first dry-run scan against a small set of local repos and record the + resulting implementation backlog. + +## Open Questions + +- Should discovered entities live only as registry-side candidates, or should + accepted discoveries eventually generate repo-owned declaration patches? +- Which source types should be accepted without human review, and which should + always remain candidates? +- How long should tombstones be retained, and should retention be per repo, + extractor, or entity kind? +- Should LLM extraction be synchronous in CLI scans, queued as background work, + or both? +- What budget and privacy controls are required before sending repo evidence to + an external model provider? + +## Close Criteria + +- A repo can be scanned, rescanned, and diffed without creating duplicate graph + entities. +- Removed repo evidence causes scoped candidate retirements instead of stale + active entities. +- LLM-assisted extraction is optional, schema-validated, provenance-rich, and + testable offline through `MockLLMAdapter`. +- Registry and graph explorer surfaces can show discovered facts with confidence + and review status. +- The first multi-repo dry run produces useful candidate graph improvements and + a clear next implementation backlog.