Files

tegwick a76c6a4aea Add llm-assisted discovery extraction

2026-05-19 04:35:35 +02:00

13 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	planning_priority	planning_order	created	updated	state_hub_workstream_id
RAIL-FAB-WP-0010	workplan	Repo Reality Scanner	railiance	railiance-fabric	active	codex	railiance	high	10	2026-05-19	2026-05-19	ac88ac5e-7183-47ba-ac05-1a1b44cfd47e

RAIL-FAB-WP-0010 - Repo Reality Scanner

Goal

Create a facility that scans or rescans repositories and turns observed repo reality into Fabric graph entities: nodes, edges, and attributes. The scanner should combine deterministic extraction rules with LLM-assisted interpretation through llm-connect, so the Fabric graph can become more accurate over time without requiring every repository to hand-author perfect declarations first.

The facility must be safe for repeated rescans. It should avoid duplicate entities, update changed entities, and retire entities that are no longer observable. The result should be a disciplined discovery pipeline that can follow source files, manifests, service catalogs, package registries, API contracts, deployment descriptors, and other systematic sources before asking an LLM to infer higher-level structure.

Background

Railiance Fabric currently treats repo-owned declarations as the authoritative graph source and the registry as a read/index layer. That remains the desired direction for high-confidence data. This workplan adds a discovery layer that can propose, maintain, and explain graph facts gathered from available repos.

The scanner should not create untraceable graph magic. Every inferred node, edge, and attribute needs source evidence, confidence, extraction method, stable identity, and replacement semantics. Deterministic rules should win where possible; LLM extraction should fill gaps, classify ambiguous evidence, and explain candidate relationships using structured outputs.

llm-connect provides a local Python integration surface for this:

create_adapter(provider, model=...)
RunConfig
MockLLMAdapter for offline tests
provider resolution via TOML configuration
typed LLM errors and usage metadata

Design Principles

Prefer systematic evidence before LLM inference.
Preserve repo-owned declarations as higher-trust data than discovery output.
Make stable entity identity explicit and deterministic.
Treat rescans as replacement runs over a scoped source set, not as append-only guesses.
Keep duplicates from entering the graph by matching stable keys, aliases, source anchors, normalized names, and declared relationships.
Track stale or vanished evidence with tombstones or retired candidates rather than silently leaving old entities active.
Make every candidate explainable: source path, extractor, prompt version, evidence snippets or structured references, confidence, and review state.
Keep LLM prompts small, auditable, testable, and replaceable.
Support dry-run diffs before accepting discovery changes into the registry.

Proposed Architecture

The scanner should produce a FabricDiscoverySnapshot for a repository and commit. A snapshot is a normalized candidate graph plus evidence metadata:

candidate nodes
candidate edges
candidate attributes
source anchors
extraction provenance
identity keys and aliases
confidence/review state
replacement scope
tombstones for previously observed candidates no longer present

Accepted scanner output can then be projected into existing FabricGraphExport snapshots, graph explorer payloads, and registry drift APIs. Where repo-owned declarations already exist, discovered entities should be matched and attached as supporting evidence rather than duplicated.

Identity And Rescan Model

The key hard problem is identity. Each entity needs:

a canonical stable key derived from repo slug, entity kind, normalized name, and source-specific anchors when available
an alias set for names found in code, manifests, package metadata, route names, catalog ids, URLs, container names, deployment names, and service ids
evidence fingerprints for source fragments that explain why the entity exists
relationship keys based on normalized source id, target id, edge type, and evidence scope
a replacement scope that says which extractor owns which candidates for a repo/commit

On rescan:

candidates with the same stable key are updated in place
candidates matching only by alias/fingerprint are merged conservatively and flagged for review
candidates missing from the same replacement scope become retired/tombstoned
repo-owned declarations override discovery candidates where they conflict
unreviewed LLM-only candidates remain candidate/review state, not accepted authoritative declarations

Tasks

T01 - Discovery Contract And Identity Model

id: RAIL-FAB-WP-0010-T01
status: done
priority: high
state_hub_task_id: "d77423fa-a47f-4246-86bd-ea1ca2d17bc4"

Define the discovery snapshot contract, identity keys, alias handling, replacement scopes, provenance fields, confidence/review states, and how discovered facts map into existing Fabric graph exports.

Acceptance notes:

Add schema/docs for FabricDiscoverySnapshot, candidate nodes, candidate edges, candidate attributes, source anchors, extractor provenance, and tombstones.
Define stable-key generation rules for services, libraries, interfaces, dependencies, deployment/runtime entities, registries/catalog entries, and generic discovered entities.
Define merge precedence between repo-owned declarations, deterministic discoveries, registry/catalog discoveries, and LLM discoveries.
Define how duplicate candidates are detected and how uncertain matches are flagged for human review.
Define rescan replacement semantics so stale entities are retired only within the source scope that produced them.

T02 - Deterministic Repo Extractor Framework

id: RAIL-FAB-WP-0010-T02
status: done
priority: high
state_hub_task_id: "5d2ff304-9c79-4699-bf8c-ed6db3a90d9f"

Implement the first deterministic extraction framework for repo-local evidence before involving LLMs.

Acceptance notes:

Add a scanner module and CLI surface that accepts repo path, repo slug, commit, scan profile, and dry-run/output options.
Extract systematic evidence from known file families such as package manifests, lockfiles, Docker/Compose files, OpenAPI/AsyncAPI files, Score files, Helm/Kubernetes manifests, service config files, existing fabric/ declarations, README/INTENT/SCOPE metadata, and repository metadata.
Emit normalized candidate entities with source anchors and extractor ids.
Keep extractor rules independently testable with fixtures.
Avoid network access by default unless a connector profile explicitly enables registry/catalog lookups.

T03 - LLM-Assisted Extraction Through llm-connect

id: RAIL-FAB-WP-0010-T03
status: done
priority: high
state_hub_task_id: "59c206a3-94b9-4f47-9c4f-75f87aa8f505"

Add LLM-assisted extraction for ambiguous repo evidence using llm-connect with structured, auditable outputs.

Acceptance notes:

Integrate llm-connect through a small adapter boundary using create_adapter, RunConfig, provider/model configuration, and MockLLMAdapter tests.
Use deterministic preselection to choose small evidence bundles for the LLM; do not send whole repos blindly.
Require structured JSON output with candidate nodes, edges, attributes, evidence references, confidence, uncertainty, and rationale.
Validate LLM output against the discovery schema before it can enter a snapshot.
Record prompt version, model, provider, usage metadata, and extraction run id.
Fail closed: malformed or low-confidence LLM output becomes a review artifact, not accepted graph data.

T04 - Reconciliation, Deduplication, And Tombstones

id: RAIL-FAB-WP-0010-T04
status: todo
priority: high
state_hub_task_id: "f0844595-23e0-4e7a-bfd9-e0526b8f85b9"

Build the reconciliation engine that merges deterministic, catalog, declaration, and LLM candidates into a coherent repo discovery snapshot.

Acceptance notes:

Match candidates by stable key, source anchor, alias, normalized labels, and relationship fingerprints.
Merge attributes with source-aware precedence and conflict reporting.
Prevent duplicate nodes/edges from entering accepted scanner output.
Produce explicit added/changed/retired/conflicted candidate sets for dry-run review.
Retire vanished candidates only inside their extractor replacement scope.
Preserve historical tombstones long enough to explain graph drift and avoid immediate re-creation loops.

T05 - Registry And Catalog Follow-Up Connectors

id: RAIL-FAB-WP-0010-T05
status: todo
priority: medium
state_hub_task_id: "d664301d-c531-4cf8-a1dd-cbadda0e0fdb"

Add connector slots for systematic follow-up against registries and catalogs where a repo points to more authoritative metadata.

Acceptance notes:

Define connector interfaces for package registries, container registries, API catalogs, service catalogs, deployment inventories, and existing Fabric registry data.
Add at least one offline-safe prototype connector using local registry data or fixtures.
Keep connector evidence separate from repo-local evidence in replacement scopes.
Represent connector failures, rate limits, and unavailable catalogs without corrupting repo-local scan results.
Document when connector-derived facts should be accepted, candidate, or review-only.

T06 - Registry Integration And Dry-Run Review

id: RAIL-FAB-WP-0010-T06
status: todo
priority: high
state_hub_task_id: "9a8420f1-0072-4f40-8d0f-775f59cbe772"

Integrate discovery snapshots with the Fabric registry so scans can be reviewed, accepted, and reflected in graph queries without losing provenance.

Acceptance notes:

Add storage/API support for latest discovery snapshots per repo/commit/profile.
Add dry-run diff output that shows candidate additions, changes, retirements, conflicts, duplicate merges, and confidence changes.
Add an accept path that projects accepted discovery output into the combined graph without overwriting repo-owned declarations.
Expose discovery provenance and review state through inventory, graph, drift, and graph explorer payloads.
Preserve existing registry snapshot replacement semantics for accepted graph exports.

T07 - Multi-Repo Scan Orchestration

id: RAIL-FAB-WP-0010-T07
status: todo
priority: medium
state_hub_task_id: "28014246-0a64-4d69-8065-98de881bffb4"

Add orchestration so the scanner can run across the known local repo manifests and improve Fabric coverage over time.

Acceptance notes:

Extend or complement registry sync-manifest with scan/rescan commands.
Support repo allowlists, scan profiles, max LLM budget, deterministic-only mode, and dry-run mode.
Produce a concise summary across repos: scanned, changed, retired, conflicted, LLM skipped, LLM failed, and accepted.
Keep one repo failure from aborting the entire multi-repo run.
Record enough metadata for State Hub progress notes and future automation.

T08 - Tests, Fixtures, Documentation, And Rollout

id: RAIL-FAB-WP-0010-T08
status: todo
priority: medium
state_hub_task_id: "7a5b7dd7-92c6-4ac5-ae4d-6e73f75aac0d"

Harden the scanner with fixture coverage and clear adoption guidance before using it broadly.

Acceptance notes:

Add fixture repos that cover deterministic-only discovery, LLM-assisted discovery, duplicate prevention, vanished entity retirement, declaration override, and connector evidence.
Test rescans across at least three commits/snapshots to prove duplicates are not created and stale entities are retired.
Test with MockLLMAdapter so CI does not require network or provider keys.
Document scanner commands, scan profiles, identity model, review workflow, LLM configuration, and failure modes.
Run the first dry-run scan against a small set of local repos and record the resulting implementation backlog.

Open Questions

Should discovered entities live only as registry-side candidates, or should accepted discoveries eventually generate repo-owned declaration patches?
Which source types should be accepted without human review, and which should always remain candidates?
How long should tombstones be retained, and should retention be per repo, extractor, or entity kind?
Should LLM extraction be synchronous in CLI scans, queued as background work, or both?
What budget and privacy controls are required before sending repo evidence to an external model provider?

Close Criteria

A repo can be scanned, rescanned, and diffed without creating duplicate graph entities.
Removed repo evidence causes scoped candidate retirements instead of stale active entities.
LLM-assisted extraction is optional, schema-validated, provenance-rich, and testable offline through MockLLMAdapter.
Registry and graph explorer surfaces can show discovered facts with confidence and review status.
The first multi-repo dry run produces useful candidate graph improvements and a clear next implementation backlog.

13 KiB Raw Blame History

RAIL-FAB-WP-0010 - Repo Reality Scanner

Goal

Background

Design Principles

Proposed Architecture

Identity And Rescan Model

Tasks

T01 - Discovery Contract And Identity Model

T02 - Deterministic Repo Extractor Framework

T03 - LLM-Assisted Extraction Through llm-connect

T04 - Reconciliation, Deduplication, And Tombstones

T05 - Registry And Catalog Follow-Up Connectors

T06 - Registry Integration And Dry-Run Review

T07 - Multi-Repo Scan Orchestration

T08 - Tests, Fixtures, Documentation, And Rollout

Open Questions

Close Criteria

13 KiB

Raw Blame History