Files
repo-scoping/docs/semantic-retrieval.md
2026-04-26 16:05:27 +02:00

1.4 KiB

Semantic Retrieval Notes

T02 introduces semantic retrieval as an optional layer above the existing SQLite text search. The default service path remains text-only so existing callers keep stable result sets and ordering.

Local provider

HashingEmbeddingProvider is the offline provider used for tests and local development. It produces deterministic token-bucket vectors without any network dependency. Configure it with:

REPO_REGISTRY_EMBEDDING_PROVIDER=hashing

When enabled, search combines:

  • text match score from the existing SQLite search path
  • vector score from approved ability/capability entries and content chunks
  • approved confidence as a small ranking prior

PostgreSQL / pgvector path

SQLite dev mode should remain the lowest-friction path. A production PostgreSQL deployment can add pgvector without changing the registry API by introducing an embedding table keyed by source entity:

CREATE TABLE registry_embeddings (
  id bigserial PRIMARY KEY,
  repository_id bigint NOT NULL,
  source_table text NOT NULL,
  source_id bigint NOT NULL,
  provider text NOT NULL,
  vector vector(768) NOT NULL,
  updated_at timestamptz NOT NULL DEFAULT now(),
  UNIQUE (source_table, source_id, provider)
);

The search service can then replace runtime embedding of stored text with indexed nearest-neighbor lookup, while retaining the current hybrid rank formula and the same response schema.