Files
repo-scoping/wiki/ArchitectureSketch.md

7.0 KiB

ArchitectureSketch

  • Repository Scoping — Architecture v0.1*

Repository Scoping — Architecture v0.1

1. Core architectural idea

Use a pipeline + registry + inspection UI architecture.

Git Repo
  ↓
Ingestion
  ↓
Analysis Pipeline
  ↓
Candidate Registry Entries
  ↓
Human Review / Approval
  ↓
Searchable Ability Registry
  ↓
Web UI / API / CLI

The system should not pretend the first analysis is truth. It produces reviewable candidates.


2. Main components

1. Registry Web App

Purpose:

  • register repositories
  • trigger analysis
  • review results
  • inspect ability maps
  • search repos

Could be a normal web app with:

Frontend + Backend API + Database

2. Git Ingestion Service

Responsibilities:

  • clone/pull repositories
  • checkout commit
  • store snapshot metadata
  • detect repo structure

Outputs:

repo_snapshot:
  repo_id
  commit_hash
  branch
  file_tree
  metadata_files

3. Repository Analyzer

This is the heart.

Pipeline stages:

Structure Scanner
Documentation Scanner
Interface Scanner
Test Scanner
LLM-Assisted Extractor
Evidence Linker
Confidence Scorer

Important: split deterministic scanners from LLM extraction.


4. Candidate Registry Store

Stores unapproved results:

candidate abilities
candidate capabilities
candidate features
candidate evidence
source references
confidence scores

These are editable.


5. Curator Review Layer

Allows a human or agent to:

  • accept
  • reject
  • rename
  • merge
  • relink
  • approve

This turns candidates into official registry entries.


6. Search / Query Layer

Supports:

natural language search
ability search
capability search
repo search
feature search
evidence search

Use both:

  • relational filters
  • vector/semantic search

7. Public/Agent API

Expose structured access:

GET /repos
GET /repos/{id}
GET /abilities
GET /capabilities
GET /search?q=...
GET /repos/{id}/ability-map

Later this becomes MCP-friendly.


3. Suggested storage architecture

Use a hybrid model:

PostgreSQL

For canonical structured data:

repositories
snapshots
abilities
capabilities
features
evidence
links
analysis_runs
review_status

Vector index

For semantic search over:

README chunks
docs chunks
ability descriptions
capability descriptions
feature descriptions

Start simple with pgvector inside PostgreSQL.

Object/file storage

For:

repo snapshots
analysis artifacts
parsed file summaries
exported registry YAML

Local filesystem is fine for MVP.


4. Data model sketch

Repository
  has many Snapshots
  has many AnalysisRuns
  has many RegistryEntries

Snapshot
  commit_hash
  branch
  file_tree
  extracted_documents

AnalysisRun
  snapshot_id
  status
  started_at
  completed_at
  model_used
  analyzer_version

Ability
  repo_id
  name
  description
  confidence
  status

Capability
  repo_id
  ability_id
  name
  description
  inputs
  outputs
  confidence
  status

Feature
  repo_id
  capability_id
  name
  type
  location
  confidence
  status

Evidence
  repo_id
  capability_id
  type
  path
  strength

5. Analysis pipeline

Step 1 — Clone / update repo

git clone
checkout commit
record commit hash

Step 2 — Deterministic scan

Detect:

languages
frameworks
package managers
entrypoints
routes
CLI commands
tests
docs
examples
config files

This should be deterministic code, not LLM.


Step 3 — Content chunking

Create chunks from:

README
docs
examples
tests
API specs
selected source files

Each chunk keeps:

file path
line range
content type
semantic role

Step 4 — LLM-assisted extraction

Ask the model separately for:

candidate abilities
candidate capabilities
candidate features
evidence mappings

Do not ask one giant prompt to do everything.


Step 5 — Confidence scoring

Combine:

LLM confidence
source quality
tests present
examples present
implementation found
multiple-source agreement

Step 6 — Candidate graph generation

Output:

Ability
  → Capability
    → Feature
    → Evidence

Step 7 — Review

Only approved entries become canonical.


6. Important design decision

Separate:

Observed facts

from:

Interpreted claims

Example:

Observed fact

file: src/routes/classify.py
route: POST /classify

Interpreted claim

capability: Classify Incoming Email

The first is source-derived. The second is inferred.

This distinction is crucial for trust.


7. MVP technology suggestion

A very pragmatic stack:

Backend: Python FastAPI
DB: PostgreSQL + pgvector
Worker: Celery/RQ or simple background jobs
Git analysis: GitPython / subprocess git
Frontend: React / Next.js or simple server-rendered app
LLM extraction: provider-abstracted interface

Given your broader agent/tooling context, Python is probably best for the analyzer.


8. Web UI structure

Page 1 — Repository List

Shows:

name
description
status
last analyzed
top abilities

Page 2 — Register Repository

Input:

Git URL
branch
access token optional

Page 3 — Analysis Run

Shows:

scan progress
detected structure
candidate entries
warnings

Page 4 — Review

Tree view:

Ability
  Capability
    Feature
    Evidence

Actions:

approve
edit
reject
merge
relink

Page 5 — Repository Profile

Final inspectable view.

Natural-language search with filters:

domain
language
framework
capability type
maturity
evidence strength

9. Internal API boundaries

Keep clean module boundaries:

repo_ingestion
repo_scanning
content_indexing
llm_extraction
candidate_graph
review_workflow
registry_query
web_api

This prevents the analyzer from becoming a ball of mud.


10. What to avoid in v0.1

Do not build yet:

continuous GitHub app integration
full static code analysis
full ontology engine
automatic truth claims
complex permission system
benchmark execution
marketplace functionality

MVP should prove:

Can we register repos, extract useful maps, review them, and search them?


Milestone 1 — Manual Registry

Create schema + UI where entries can be entered manually.

Milestone 2 — Deterministic Scanner

Add repo clone + README/docs/tests/interface detection.

Milestone 3 — LLM Candidate Extraction

Generate candidate ability/capability/feature graph.

Milestone 4 — Review Workflow

Approve/edit/reject extracted entries.

Add semantic search over approved registry entries.


12. Architecture principle

Deterministic scanners establish facts. LLMs propose interpretations. Humans or trusted agents approve registry truth.

That should be the backbone.

xxx