ArchitectureSketch

* Repository Scoping — Architecture v0.1*

# Repository Scoping — Architecture v0.1

## 1. Core architectural idea

Use a **pipeline + registry + inspection UI** architecture.

```text
Git Repo
  ↓
Ingestion
  ↓
Analysis Pipeline
  ↓
Candidate Registry Entries
  ↓
Human Review / Approval
  ↓
Searchable Ability Registry
  ↓
Web UI / API / CLI
```

The system should not pretend the first analysis is truth. It produces **reviewable candidates**.

---

## 2. Main components

### 1. Registry Web App

Purpose:

* register repositories
* trigger analysis
* review results
* inspect ability maps
* search repos

Could be a normal web app with:

```text
Frontend + Backend API + Database
```

---

### 2. Git Ingestion Service

Responsibilities:

* clone/pull repositories
* checkout commit
* store snapshot metadata
* detect repo structure

Outputs:

```yaml
repo_snapshot:
  repo_id
  commit_hash
  branch
  file_tree
  metadata_files
```

---

### 3. Repository Analyzer

This is the heart.

Pipeline stages:

```text
Structure Scanner
Documentation Scanner
Interface Scanner
Test Scanner
LLM-Assisted Extractor
Evidence Linker
Confidence Scorer
```

Important: split deterministic scanners from LLM extraction.

---

### 4. Candidate Registry Store

Stores unapproved results:

```text
candidate abilities
candidate capabilities
candidate features
candidate evidence
source references
confidence scores
```

These are editable.

---

### 5. Curator Review Layer

Allows a human or agent to:

* accept
* reject
* rename
* merge
* relink
* approve

This turns candidates into official registry entries.

---

### 6. Search / Query Layer

Supports:

```text
natural language search
ability search
capability search
repo search
feature search
evidence search
```

Use both:

* relational filters
* vector/semantic search

---

### 7. Public/Agent API

Expose structured access:

```http
GET /repos
GET /repos/{id}
GET /abilities
GET /capabilities
GET /search?q=...
GET /repos/{id}/ability-map
```

Later this becomes MCP-friendly.

---

## 3. Suggested storage architecture

Use a hybrid model:

### PostgreSQL

For canonical structured data:

```text
repositories
snapshots
abilities
capabilities
features
evidence
links
analysis_runs
review_status
```

### Vector index

For semantic search over:

```text
README chunks
docs chunks
ability descriptions
capability descriptions
feature descriptions
```

Start simple with `pgvector` inside PostgreSQL.

### Object/file storage

For:

```text
repo snapshots
analysis artifacts
parsed file summaries
exported registry YAML
```

Local filesystem is fine for MVP.

---

## 4. Data model sketch

```text
Repository
  has many Snapshots
  has many AnalysisRuns
  has many RegistryEntries

Snapshot
  commit_hash
  branch
  file_tree
  extracted_documents

AnalysisRun
  snapshot_id
  status
  started_at
  completed_at
  model_used
  analyzer_version

Ability
  repo_id
  name
  description
  confidence
  status

Capability
  repo_id
  ability_id
  name
  description
  inputs
  outputs
  confidence
  status

Feature
  repo_id
  capability_id
  name
  type
  location
  confidence
  status

Evidence
  repo_id
  capability_id
  type
  path
  strength
```

---

## 5. Analysis pipeline

### Step 1 — Clone / update repo

```text
git clone
checkout commit
record commit hash
```

---

### Step 2 — Deterministic scan

Detect:

```text
languages
frameworks
package managers
entrypoints
routes
CLI commands
tests
docs
examples
config files
```

This should be deterministic code, not LLM.

---

### Step 3 — Content chunking

Create chunks from:

```text
README
docs
examples
tests
API specs
selected source files
```

Each chunk keeps:

```text
file path
line range
content type
semantic role
```

---

### Step 4 — LLM-assisted extraction

Ask the model separately for:

```text
candidate abilities
candidate capabilities
candidate features
evidence mappings
```

Do not ask one giant prompt to do everything.

---

### Step 5 — Confidence scoring

Combine:

```text
LLM confidence
source quality
tests present
examples present
implementation found
multiple-source agreement
```

---

### Step 6 — Candidate graph generation

Output:

```text
Ability
  → Capability
    → Feature
    → Evidence
```

---

### Step 7 — Review

Only approved entries become canonical.

---

## 6. Important design decision

Separate:

```text
Observed facts
```

from:

```text
Interpreted claims
```

Example:

### Observed fact

```yaml
file: src/routes/classify.py
route: POST /classify
```

### Interpreted claim

```yaml
capability: Classify Incoming Email
```

The first is source-derived.
The second is inferred.

This distinction is crucial for trust.

---

## 7. MVP technology suggestion

A very pragmatic stack:

```text
Backend: Python FastAPI
DB: PostgreSQL + pgvector
Worker: Celery/RQ or simple background jobs
Git analysis: GitPython / subprocess git
Frontend: React / Next.js or simple server-rendered app
LLM extraction: provider-abstracted interface
```

Given your broader agent/tooling context, Python is probably best for the analyzer.

---

## 8. Web UI structure

### Page 1 — Repository List

Shows:

```text
name
description
status
last analyzed
top abilities
```

### Page 2 — Register Repository

Input:

```text
Git URL
branch
access token optional
```

### Page 3 — Analysis Run

Shows:

```text
scan progress
detected structure
candidate entries
warnings
```

### Page 4 — Review

Tree view:

```text
Ability
  Capability
    Feature
    Evidence
```

Actions:

```text
approve
edit
reject
merge
relink
```

### Page 5 — Repository Profile

Final inspectable view.

### Page 6 — Search

Natural-language search with filters:

```text
domain
language
framework
capability type
maturity
evidence strength
```

---

## 9. Internal API boundaries

Keep clean module boundaries:

```text
repo_ingestion
repo_scanning
content_indexing
llm_extraction
candidate_graph
review_workflow
registry_query
web_api
```

This prevents the analyzer from becoming a ball of mud.

---

## 10. What to avoid in v0.1

Do not build yet:

```text
continuous GitHub app integration
full static code analysis
full ontology engine
automatic truth claims
complex permission system
benchmark execution
marketplace functionality
```

MVP should prove:

> Can we register repos, extract useful maps, review them, and search them?

---

## 11. Recommended first implementation path

### Milestone 1 — Manual Registry

Create schema + UI where entries can be entered manually.

### Milestone 2 — Deterministic Scanner

Add repo clone + README/docs/tests/interface detection.

### Milestone 3 — LLM Candidate Extraction

Generate candidate ability/capability/feature graph.

### Milestone 4 — Review Workflow

Approve/edit/reject extracted entries.

### Milestone 5 — Search

Add semantic search over approved registry entries.

---

## 12. Architecture principle

> Deterministic scanners establish facts.
> LLMs propose interpretations.
> Humans or trusted agents approve registry truth.

That should be the backbone.


xxx