Files
repo-scoping/wiki/ArchitectureSketch.md

576 lines
7.0 KiB
Markdown

ArchitectureSketch
* Repository Scoping — Architecture v0.1*
# Repository Scoping — Architecture v0.1
## 1. Core architectural idea
Use a **pipeline + registry + inspection UI** architecture.
```text
Git Repo
Ingestion
Analysis Pipeline
Candidate Registry Entries
Human Review / Approval
Searchable Ability Registry
Web UI / API / CLI
```
The system should not pretend the first analysis is truth. It produces **reviewable candidates**.
---
## 2. Main components
### 1. Registry Web App
Purpose:
* register repositories
* trigger analysis
* review results
* inspect ability maps
* search repos
Could be a normal web app with:
```text
Frontend + Backend API + Database
```
---
### 2. Git Ingestion Service
Responsibilities:
* clone/pull repositories
* checkout commit
* store snapshot metadata
* detect repo structure
Outputs:
```yaml
repo_snapshot:
repo_id
commit_hash
branch
file_tree
metadata_files
```
---
### 3. Repository Analyzer
This is the heart.
Pipeline stages:
```text
Structure Scanner
Documentation Scanner
Interface Scanner
Test Scanner
LLM-Assisted Extractor
Evidence Linker
Confidence Scorer
```
Important: split deterministic scanners from LLM extraction.
---
### 4. Candidate Registry Store
Stores unapproved results:
```text
candidate abilities
candidate capabilities
candidate features
candidate evidence
source references
confidence scores
```
These are editable.
---
### 5. Curator Review Layer
Allows a human or agent to:
* accept
* reject
* rename
* merge
* relink
* approve
This turns candidates into official registry entries.
---
### 6. Search / Query Layer
Supports:
```text
natural language search
ability search
capability search
repo search
feature search
evidence search
```
Use both:
* relational filters
* vector/semantic search
---
### 7. Public/Agent API
Expose structured access:
```http
GET /repos
GET /repos/{id}
GET /abilities
GET /capabilities
GET /search?q=...
GET /repos/{id}/ability-map
```
Later this becomes MCP-friendly.
---
## 3. Suggested storage architecture
Use a hybrid model:
### PostgreSQL
For canonical structured data:
```text
repositories
snapshots
abilities
capabilities
features
evidence
links
analysis_runs
review_status
```
### Vector index
For semantic search over:
```text
README chunks
docs chunks
ability descriptions
capability descriptions
feature descriptions
```
Start simple with `pgvector` inside PostgreSQL.
### Object/file storage
For:
```text
repo snapshots
analysis artifacts
parsed file summaries
exported registry YAML
```
Local filesystem is fine for MVP.
---
## 4. Data model sketch
```text
Repository
has many Snapshots
has many AnalysisRuns
has many RegistryEntries
Snapshot
commit_hash
branch
file_tree
extracted_documents
AnalysisRun
snapshot_id
status
started_at
completed_at
model_used
analyzer_version
Ability
repo_id
name
description
confidence
status
Capability
repo_id
ability_id
name
description
inputs
outputs
confidence
status
Feature
repo_id
capability_id
name
type
location
confidence
status
Evidence
repo_id
capability_id
type
path
strength
```
---
## 5. Analysis pipeline
### Step 1 — Clone / update repo
```text
git clone
checkout commit
record commit hash
```
---
### Step 2 — Deterministic scan
Detect:
```text
languages
frameworks
package managers
entrypoints
routes
CLI commands
tests
docs
examples
config files
```
This should be deterministic code, not LLM.
---
### Step 3 — Content chunking
Create chunks from:
```text
README
docs
examples
tests
API specs
selected source files
```
Each chunk keeps:
```text
file path
line range
content type
semantic role
```
---
### Step 4 — LLM-assisted extraction
Ask the model separately for:
```text
candidate abilities
candidate capabilities
candidate features
evidence mappings
```
Do not ask one giant prompt to do everything.
---
### Step 5 — Confidence scoring
Combine:
```text
LLM confidence
source quality
tests present
examples present
implementation found
multiple-source agreement
```
---
### Step 6 — Candidate graph generation
Output:
```text
Ability
→ Capability
→ Feature
→ Evidence
```
---
### Step 7 — Review
Only approved entries become canonical.
---
## 6. Important design decision
Separate:
```text
Observed facts
```
from:
```text
Interpreted claims
```
Example:
### Observed fact
```yaml
file: src/routes/classify.py
route: POST /classify
```
### Interpreted claim
```yaml
capability: Classify Incoming Email
```
The first is source-derived.
The second is inferred.
This distinction is crucial for trust.
---
## 7. MVP technology suggestion
A very pragmatic stack:
```text
Backend: Python FastAPI
DB: PostgreSQL + pgvector
Worker: Celery/RQ or simple background jobs
Git analysis: GitPython / subprocess git
Frontend: React / Next.js or simple server-rendered app
LLM extraction: provider-abstracted interface
```
Given your broader agent/tooling context, Python is probably best for the analyzer.
---
## 8. Web UI structure
### Page 1 — Repository List
Shows:
```text
name
description
status
last analyzed
top abilities
```
### Page 2 — Register Repository
Input:
```text
Git URL
branch
access token optional
```
### Page 3 — Analysis Run
Shows:
```text
scan progress
detected structure
candidate entries
warnings
```
### Page 4 — Review
Tree view:
```text
Ability
Capability
Feature
Evidence
```
Actions:
```text
approve
edit
reject
merge
relink
```
### Page 5 — Repository Profile
Final inspectable view.
### Page 6 — Search
Natural-language search with filters:
```text
domain
language
framework
capability type
maturity
evidence strength
```
---
## 9. Internal API boundaries
Keep clean module boundaries:
```text
repo_ingestion
repo_scanning
content_indexing
llm_extraction
candidate_graph
review_workflow
registry_query
web_api
```
This prevents the analyzer from becoming a ball of mud.
---
## 10. What to avoid in v0.1
Do not build yet:
```text
continuous GitHub app integration
full static code analysis
full ontology engine
automatic truth claims
complex permission system
benchmark execution
marketplace functionality
```
MVP should prove:
> Can we register repos, extract useful maps, review them, and search them?
---
## 11. Recommended first implementation path
### Milestone 1 — Manual Registry
Create schema + UI where entries can be entered manually.
### Milestone 2 — Deterministic Scanner
Add repo clone + README/docs/tests/interface detection.
### Milestone 3 — LLM Candidate Extraction
Generate candidate ability/capability/feature graph.
### Milestone 4 — Review Workflow
Approve/edit/reject extracted entries.
### Milestone 5 — Search
Add semantic search over approved registry entries.
---
## 12. Architecture principle
> Deterministic scanners establish facts.
> LLMs propose interpretations.
> Humans or trusted agents approve registry truth.
That should be the backbone.
xxx