Add repo reality scanner workplan

2026-05-19 02:26:02 +02:00
parent edf2062a7a
commit 5523b8f493
1 changed files with 334 additions and 0 deletions
--- a/workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md
+++ b/workplans/RAIL-FAB-WP-0010-repo-reality-scanner.md
@@ -0,0 +1,334 @@
+---
+id: RAIL-FAB-WP-0010
+type: workplan
+title: "Repo Reality Scanner"
+domain: railiance
+repo: railiance-fabric
+status: proposed
+owner: codex
+topic_slug: railiance
+planning_priority: high
+planning_order: 10
+created: "2026-05-19"
+updated: "2026-05-19"
+---
+
+# RAIL-FAB-WP-0010 - Repo Reality Scanner
+
+## Goal
+
+Create a facility that scans or rescans repositories and turns observed repo
+reality into Fabric graph entities: nodes, edges, and attributes. The scanner
+should combine deterministic extraction rules with LLM-assisted interpretation
+through `llm-connect`, so the Fabric graph can become more accurate over time
+without requiring every repository to hand-author perfect declarations first.
+
+The facility must be safe for repeated rescans. It should avoid duplicate
+entities, update changed entities, and retire entities that are no longer
+observable. The result should be a disciplined discovery pipeline that can
+follow source files, manifests, service catalogs, package registries, API
+contracts, deployment descriptors, and other systematic sources before asking
+an LLM to infer higher-level structure.
+
+## Background
+
+Railiance Fabric currently treats repo-owned declarations as the authoritative
+graph source and the registry as a read/index layer. That remains the desired
+direction for high-confidence data. This workplan adds a discovery layer that
+can propose, maintain, and explain graph facts gathered from available repos.
+
+The scanner should not create untraceable graph magic. Every inferred node,
+edge, and attribute needs source evidence, confidence, extraction method,
+stable identity, and replacement semantics. Deterministic rules should win
+where possible; LLM extraction should fill gaps, classify ambiguous evidence,
+and explain candidate relationships using structured outputs.
+
+`llm-connect` provides a local Python integration surface for this:
+
+- `create_adapter(provider, model=...)`
+- `RunConfig`
+- `MockLLMAdapter` for offline tests
+- provider resolution via TOML configuration
+- typed LLM errors and usage metadata
+
+## Design Principles
+
+- Prefer systematic evidence before LLM inference.
+- Preserve repo-owned declarations as higher-trust data than discovery output.
+- Make stable entity identity explicit and deterministic.
+- Treat rescans as replacement runs over a scoped source set, not as append-only
+  guesses.
+- Keep duplicates from entering the graph by matching stable keys, aliases,
+  source anchors, normalized names, and declared relationships.
+- Track stale or vanished evidence with tombstones or retired candidates rather
+  than silently leaving old entities active.
+- Make every candidate explainable: source path, extractor, prompt version,
+  evidence snippets or structured references, confidence, and review state.
+- Keep LLM prompts small, auditable, testable, and replaceable.
+- Support dry-run diffs before accepting discovery changes into the registry.
+
+## Proposed Architecture
+
+The scanner should produce a `FabricDiscoverySnapshot` for a repository and
+commit. A snapshot is a normalized candidate graph plus evidence metadata:
+
+- candidate nodes
+- candidate edges
+- candidate attributes
+- source anchors
+- extraction provenance
+- identity keys and aliases
+- confidence/review state
+- replacement scope
+- tombstones for previously observed candidates no longer present
+
+Accepted scanner output can then be projected into existing `FabricGraphExport`
+snapshots, graph explorer payloads, and registry drift APIs. Where repo-owned
+declarations already exist, discovered entities should be matched and attached
+as supporting evidence rather than duplicated.
+
+## Identity And Rescan Model
+
+The key hard problem is identity. Each entity needs:
+
+- a canonical stable key derived from repo slug, entity kind, normalized name,
+  and source-specific anchors when available
+- an alias set for names found in code, manifests, package metadata, route
+  names, catalog ids, URLs, container names, deployment names, and service ids
+- evidence fingerprints for source fragments that explain why the entity
+  exists
+- relationship keys based on normalized source id, target id, edge type, and
+  evidence scope
+- a replacement scope that says which extractor owns which candidates for a
+  repo/commit
+
+On rescan:
+
+- candidates with the same stable key are updated in place
+- candidates matching only by alias/fingerprint are merged conservatively and
+  flagged for review
+- candidates missing from the same replacement scope become retired/tombstoned
+- repo-owned declarations override discovery candidates where they conflict
+- unreviewed LLM-only candidates remain candidate/review state, not accepted
+  authoritative declarations
+
+## Tasks
+
+### T01 - Discovery Contract And Identity Model
+
+```task
+id: RAIL-FAB-WP-0010-T01
+status: todo
+priority: high
+```
+
+Define the discovery snapshot contract, identity keys, alias handling,
+replacement scopes, provenance fields, confidence/review states, and how
+discovered facts map into existing Fabric graph exports.
+
+Acceptance notes:
+
+- Add schema/docs for `FabricDiscoverySnapshot`, candidate nodes, candidate
+  edges, candidate attributes, source anchors, extractor provenance, and
+  tombstones.
+- Define stable-key generation rules for services, libraries, interfaces,
+  dependencies, deployment/runtime entities, registries/catalog entries, and
+  generic discovered entities.
+- Define merge precedence between repo-owned declarations, deterministic
+  discoveries, registry/catalog discoveries, and LLM discoveries.
+- Define how duplicate candidates are detected and how uncertain matches are
+  flagged for human review.
+- Define rescan replacement semantics so stale entities are retired only within
+  the source scope that produced them.
+
+### T02 - Deterministic Repo Extractor Framework
+
+```task
+id: RAIL-FAB-WP-0010-T02
+status: todo
+priority: high
+```
+
+Implement the first deterministic extraction framework for repo-local evidence
+before involving LLMs.
+
+Acceptance notes:
+
+- Add a scanner module and CLI surface that accepts repo path, repo slug,
+  commit, scan profile, and dry-run/output options.
+- Extract systematic evidence from known file families such as package
+  manifests, lockfiles, Docker/Compose files, OpenAPI/AsyncAPI files, Score
+  files, Helm/Kubernetes manifests, service config files, existing `fabric/`
+  declarations, README/INTENT/SCOPE metadata, and repository metadata.
+- Emit normalized candidate entities with source anchors and extractor ids.
+- Keep extractor rules independently testable with fixtures.
+- Avoid network access by default unless a connector profile explicitly enables
+  registry/catalog lookups.
+
+### T03 - LLM-Assisted Extraction Through llm-connect
+
+```task
+id: RAIL-FAB-WP-0010-T03
+status: todo
+priority: high
+```
+
+Add LLM-assisted extraction for ambiguous repo evidence using `llm-connect`
+with structured, auditable outputs.
+
+Acceptance notes:
+
+- Integrate `llm-connect` through a small adapter boundary using
+  `create_adapter`, `RunConfig`, provider/model configuration, and
+  `MockLLMAdapter` tests.
+- Use deterministic preselection to choose small evidence bundles for the LLM;
+  do not send whole repos blindly.
+- Require structured JSON output with candidate nodes, edges, attributes,
+  evidence references, confidence, uncertainty, and rationale.
+- Validate LLM output against the discovery schema before it can enter a
+  snapshot.
+- Record prompt version, model, provider, usage metadata, and extraction run id.
+- Fail closed: malformed or low-confidence LLM output becomes a review artifact,
+  not accepted graph data.
+
+### T04 - Reconciliation, Deduplication, And Tombstones
+
+```task
+id: RAIL-FAB-WP-0010-T04
+status: todo
+priority: high
+```
+
+Build the reconciliation engine that merges deterministic, catalog, declaration,
+and LLM candidates into a coherent repo discovery snapshot.
+
+Acceptance notes:
+
+- Match candidates by stable key, source anchor, alias, normalized labels, and
+  relationship fingerprints.
+- Merge attributes with source-aware precedence and conflict reporting.
+- Prevent duplicate nodes/edges from entering accepted scanner output.
+- Produce explicit added/changed/retired/conflicted candidate sets for dry-run
+  review.
+- Retire vanished candidates only inside their extractor replacement scope.
+- Preserve historical tombstones long enough to explain graph drift and avoid
+  immediate re-creation loops.
+
+### T05 - Registry And Catalog Follow-Up Connectors
+
+```task
+id: RAIL-FAB-WP-0010-T05
+status: todo
+priority: medium
+```
+
+Add connector slots for systematic follow-up against registries and catalogs
+where a repo points to more authoritative metadata.
+
+Acceptance notes:
+
+- Define connector interfaces for package registries, container registries,
+  API catalogs, service catalogs, deployment inventories, and existing Fabric
+  registry data.
+- Add at least one offline-safe prototype connector using local registry data or
+  fixtures.
+- Keep connector evidence separate from repo-local evidence in replacement
+  scopes.
+- Represent connector failures, rate limits, and unavailable catalogs without
+  corrupting repo-local scan results.
+- Document when connector-derived facts should be accepted, candidate, or
+  review-only.
+
+### T06 - Registry Integration And Dry-Run Review
+
+```task
+id: RAIL-FAB-WP-0010-T06
+status: todo
+priority: high
+```
+
+Integrate discovery snapshots with the Fabric registry so scans can be reviewed,
+accepted, and reflected in graph queries without losing provenance.
+
+Acceptance notes:
+
+- Add storage/API support for latest discovery snapshots per repo/commit/profile.
+- Add dry-run diff output that shows candidate additions, changes, retirements,
+  conflicts, duplicate merges, and confidence changes.
+- Add an accept path that projects accepted discovery output into the combined
+  graph without overwriting repo-owned declarations.
+- Expose discovery provenance and review state through inventory, graph, drift,
+  and graph explorer payloads.
+- Preserve existing registry snapshot replacement semantics for accepted graph
+  exports.
+
+### T07 - Multi-Repo Scan Orchestration
+
+```task
+id: RAIL-FAB-WP-0010-T07
+status: todo
+priority: medium
+```
+
+Add orchestration so the scanner can run across the known local repo manifests
+and improve Fabric coverage over time.
+
+Acceptance notes:
+
+- Extend or complement `registry sync-manifest` with scan/rescan commands.
+- Support repo allowlists, scan profiles, max LLM budget, deterministic-only
+  mode, and dry-run mode.
+- Produce a concise summary across repos: scanned, changed, retired, conflicted,
+  LLM skipped, LLM failed, and accepted.
+- Keep one repo failure from aborting the entire multi-repo run.
+- Record enough metadata for State Hub progress notes and future automation.
+
+### T08 - Tests, Fixtures, Documentation, And Rollout
+
+```task
+id: RAIL-FAB-WP-0010-T08
+status: todo
+priority: medium
+```
+
+Harden the scanner with fixture coverage and clear adoption guidance before
+using it broadly.
+
+Acceptance notes:
+
+- Add fixture repos that cover deterministic-only discovery, LLM-assisted
+  discovery, duplicate prevention, vanished entity retirement, declaration
+  override, and connector evidence.
+- Test rescans across at least three commits/snapshots to prove duplicates are
+  not created and stale entities are retired.
+- Test with `MockLLMAdapter` so CI does not require network or provider keys.
+- Document scanner commands, scan profiles, identity model, review workflow,
+  LLM configuration, and failure modes.
+- Run the first dry-run scan against a small set of local repos and record the
+  resulting implementation backlog.
+
+## Open Questions
+
+- Should discovered entities live only as registry-side candidates, or should
+  accepted discoveries eventually generate repo-owned declaration patches?
+- Which source types should be accepted without human review, and which should
+  always remain candidates?
+- How long should tombstones be retained, and should retention be per repo,
+  extractor, or entity kind?
+- Should LLM extraction be synchronous in CLI scans, queued as background work,
+  or both?
+- What budget and privacy controls are required before sending repo evidence to
+  an external model provider?
+
+## Close Criteria
+
+- A repo can be scanned, rescanned, and diffed without creating duplicate graph
+  entities.
+- Removed repo evidence causes scoped candidate retirements instead of stale
+  active entities.
+- LLM-assisted extraction is optional, schema-validated, provenance-rich, and
+  testable offline through `MockLLMAdapter`.
+- Registry and graph explorer surfaces can show discovered facts with confidence
+  and review status.
+- The first multi-repo dry run produces useful candidate graph improvements and
+  a clear next implementation backlog.