generated from coulomb/repo-seed
576 lines
7.0 KiB
Markdown
576 lines
7.0 KiB
Markdown
ArchitectureSketch
|
|
|
|
* Repository Scoping — Architecture v0.1*
|
|
|
|
# Repository Scoping — Architecture v0.1
|
|
|
|
## 1. Core architectural idea
|
|
|
|
Use a **pipeline + registry + inspection UI** architecture.
|
|
|
|
```text
|
|
Git Repo
|
|
↓
|
|
Ingestion
|
|
↓
|
|
Analysis Pipeline
|
|
↓
|
|
Candidate Registry Entries
|
|
↓
|
|
Human Review / Approval
|
|
↓
|
|
Searchable Ability Registry
|
|
↓
|
|
Web UI / API / CLI
|
|
```
|
|
|
|
The system should not pretend the first analysis is truth. It produces **reviewable candidates**.
|
|
|
|
---
|
|
|
|
## 2. Main components
|
|
|
|
### 1. Registry Web App
|
|
|
|
Purpose:
|
|
|
|
* register repositories
|
|
* trigger analysis
|
|
* review results
|
|
* inspect ability maps
|
|
* search repos
|
|
|
|
Could be a normal web app with:
|
|
|
|
```text
|
|
Frontend + Backend API + Database
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Git Ingestion Service
|
|
|
|
Responsibilities:
|
|
|
|
* clone/pull repositories
|
|
* checkout commit
|
|
* store snapshot metadata
|
|
* detect repo structure
|
|
|
|
Outputs:
|
|
|
|
```yaml
|
|
repo_snapshot:
|
|
repo_id
|
|
commit_hash
|
|
branch
|
|
file_tree
|
|
metadata_files
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Repository Analyzer
|
|
|
|
This is the heart.
|
|
|
|
Pipeline stages:
|
|
|
|
```text
|
|
Structure Scanner
|
|
Documentation Scanner
|
|
Interface Scanner
|
|
Test Scanner
|
|
LLM-Assisted Extractor
|
|
Evidence Linker
|
|
Confidence Scorer
|
|
```
|
|
|
|
Important: split deterministic scanners from LLM extraction.
|
|
|
|
---
|
|
|
|
### 4. Candidate Registry Store
|
|
|
|
Stores unapproved results:
|
|
|
|
```text
|
|
candidate abilities
|
|
candidate capabilities
|
|
candidate features
|
|
candidate evidence
|
|
source references
|
|
confidence scores
|
|
```
|
|
|
|
These are editable.
|
|
|
|
---
|
|
|
|
### 5. Curator Review Layer
|
|
|
|
Allows a human or agent to:
|
|
|
|
* accept
|
|
* reject
|
|
* rename
|
|
* merge
|
|
* relink
|
|
* approve
|
|
|
|
This turns candidates into official registry entries.
|
|
|
|
---
|
|
|
|
### 6. Search / Query Layer
|
|
|
|
Supports:
|
|
|
|
```text
|
|
natural language search
|
|
ability search
|
|
capability search
|
|
repo search
|
|
feature search
|
|
evidence search
|
|
```
|
|
|
|
Use both:
|
|
|
|
* relational filters
|
|
* vector/semantic search
|
|
|
|
---
|
|
|
|
### 7. Public/Agent API
|
|
|
|
Expose structured access:
|
|
|
|
```http
|
|
GET /repos
|
|
GET /repos/{id}
|
|
GET /abilities
|
|
GET /capabilities
|
|
GET /search?q=...
|
|
GET /repos/{id}/ability-map
|
|
```
|
|
|
|
Later this becomes MCP-friendly.
|
|
|
|
---
|
|
|
|
## 3. Suggested storage architecture
|
|
|
|
Use a hybrid model:
|
|
|
|
### PostgreSQL
|
|
|
|
For canonical structured data:
|
|
|
|
```text
|
|
repositories
|
|
snapshots
|
|
abilities
|
|
capabilities
|
|
features
|
|
evidence
|
|
links
|
|
analysis_runs
|
|
review_status
|
|
```
|
|
|
|
### Vector index
|
|
|
|
For semantic search over:
|
|
|
|
```text
|
|
README chunks
|
|
docs chunks
|
|
ability descriptions
|
|
capability descriptions
|
|
feature descriptions
|
|
```
|
|
|
|
Start simple with `pgvector` inside PostgreSQL.
|
|
|
|
### Object/file storage
|
|
|
|
For:
|
|
|
|
```text
|
|
repo snapshots
|
|
analysis artifacts
|
|
parsed file summaries
|
|
exported registry YAML
|
|
```
|
|
|
|
Local filesystem is fine for MVP.
|
|
|
|
---
|
|
|
|
## 4. Data model sketch
|
|
|
|
```text
|
|
Repository
|
|
has many Snapshots
|
|
has many AnalysisRuns
|
|
has many RegistryEntries
|
|
|
|
Snapshot
|
|
commit_hash
|
|
branch
|
|
file_tree
|
|
extracted_documents
|
|
|
|
AnalysisRun
|
|
snapshot_id
|
|
status
|
|
started_at
|
|
completed_at
|
|
model_used
|
|
analyzer_version
|
|
|
|
Ability
|
|
repo_id
|
|
name
|
|
description
|
|
confidence
|
|
status
|
|
|
|
Capability
|
|
repo_id
|
|
ability_id
|
|
name
|
|
description
|
|
inputs
|
|
outputs
|
|
confidence
|
|
status
|
|
|
|
Feature
|
|
repo_id
|
|
capability_id
|
|
name
|
|
type
|
|
location
|
|
confidence
|
|
status
|
|
|
|
Evidence
|
|
repo_id
|
|
capability_id
|
|
type
|
|
path
|
|
strength
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Analysis pipeline
|
|
|
|
### Step 1 — Clone / update repo
|
|
|
|
```text
|
|
git clone
|
|
checkout commit
|
|
record commit hash
|
|
```
|
|
|
|
---
|
|
|
|
### Step 2 — Deterministic scan
|
|
|
|
Detect:
|
|
|
|
```text
|
|
languages
|
|
frameworks
|
|
package managers
|
|
entrypoints
|
|
routes
|
|
CLI commands
|
|
tests
|
|
docs
|
|
examples
|
|
config files
|
|
```
|
|
|
|
This should be deterministic code, not LLM.
|
|
|
|
---
|
|
|
|
### Step 3 — Content chunking
|
|
|
|
Create chunks from:
|
|
|
|
```text
|
|
README
|
|
docs
|
|
examples
|
|
tests
|
|
API specs
|
|
selected source files
|
|
```
|
|
|
|
Each chunk keeps:
|
|
|
|
```text
|
|
file path
|
|
line range
|
|
content type
|
|
semantic role
|
|
```
|
|
|
|
---
|
|
|
|
### Step 4 — LLM-assisted extraction
|
|
|
|
Ask the model separately for:
|
|
|
|
```text
|
|
candidate abilities
|
|
candidate capabilities
|
|
candidate features
|
|
evidence mappings
|
|
```
|
|
|
|
Do not ask one giant prompt to do everything.
|
|
|
|
---
|
|
|
|
### Step 5 — Confidence scoring
|
|
|
|
Combine:
|
|
|
|
```text
|
|
LLM confidence
|
|
source quality
|
|
tests present
|
|
examples present
|
|
implementation found
|
|
multiple-source agreement
|
|
```
|
|
|
|
---
|
|
|
|
### Step 6 — Candidate graph generation
|
|
|
|
Output:
|
|
|
|
```text
|
|
Ability
|
|
→ Capability
|
|
→ Feature
|
|
→ Evidence
|
|
```
|
|
|
|
---
|
|
|
|
### Step 7 — Review
|
|
|
|
Only approved entries become canonical.
|
|
|
|
---
|
|
|
|
## 6. Important design decision
|
|
|
|
Separate:
|
|
|
|
```text
|
|
Observed facts
|
|
```
|
|
|
|
from:
|
|
|
|
```text
|
|
Interpreted claims
|
|
```
|
|
|
|
Example:
|
|
|
|
### Observed fact
|
|
|
|
```yaml
|
|
file: src/routes/classify.py
|
|
route: POST /classify
|
|
```
|
|
|
|
### Interpreted claim
|
|
|
|
```yaml
|
|
capability: Classify Incoming Email
|
|
```
|
|
|
|
The first is source-derived.
|
|
The second is inferred.
|
|
|
|
This distinction is crucial for trust.
|
|
|
|
---
|
|
|
|
## 7. MVP technology suggestion
|
|
|
|
A very pragmatic stack:
|
|
|
|
```text
|
|
Backend: Python FastAPI
|
|
DB: PostgreSQL + pgvector
|
|
Worker: Celery/RQ or simple background jobs
|
|
Git analysis: GitPython / subprocess git
|
|
Frontend: React / Next.js or simple server-rendered app
|
|
LLM extraction: provider-abstracted interface
|
|
```
|
|
|
|
Given your broader agent/tooling context, Python is probably best for the analyzer.
|
|
|
|
---
|
|
|
|
## 8. Web UI structure
|
|
|
|
### Page 1 — Repository List
|
|
|
|
Shows:
|
|
|
|
```text
|
|
name
|
|
description
|
|
status
|
|
last analyzed
|
|
top abilities
|
|
```
|
|
|
|
### Page 2 — Register Repository
|
|
|
|
Input:
|
|
|
|
```text
|
|
Git URL
|
|
branch
|
|
access token optional
|
|
```
|
|
|
|
### Page 3 — Analysis Run
|
|
|
|
Shows:
|
|
|
|
```text
|
|
scan progress
|
|
detected structure
|
|
candidate entries
|
|
warnings
|
|
```
|
|
|
|
### Page 4 — Review
|
|
|
|
Tree view:
|
|
|
|
```text
|
|
Ability
|
|
Capability
|
|
Feature
|
|
Evidence
|
|
```
|
|
|
|
Actions:
|
|
|
|
```text
|
|
approve
|
|
edit
|
|
reject
|
|
merge
|
|
relink
|
|
```
|
|
|
|
### Page 5 — Repository Profile
|
|
|
|
Final inspectable view.
|
|
|
|
### Page 6 — Search
|
|
|
|
Natural-language search with filters:
|
|
|
|
```text
|
|
domain
|
|
language
|
|
framework
|
|
capability type
|
|
maturity
|
|
evidence strength
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Internal API boundaries
|
|
|
|
Keep clean module boundaries:
|
|
|
|
```text
|
|
repo_ingestion
|
|
repo_scanning
|
|
content_indexing
|
|
llm_extraction
|
|
candidate_graph
|
|
review_workflow
|
|
registry_query
|
|
web_api
|
|
```
|
|
|
|
This prevents the analyzer from becoming a ball of mud.
|
|
|
|
---
|
|
|
|
## 10. What to avoid in v0.1
|
|
|
|
Do not build yet:
|
|
|
|
```text
|
|
continuous GitHub app integration
|
|
full static code analysis
|
|
full ontology engine
|
|
automatic truth claims
|
|
complex permission system
|
|
benchmark execution
|
|
marketplace functionality
|
|
```
|
|
|
|
MVP should prove:
|
|
|
|
> Can we register repos, extract useful maps, review them, and search them?
|
|
|
|
---
|
|
|
|
## 11. Recommended first implementation path
|
|
|
|
### Milestone 1 — Manual Registry
|
|
|
|
Create schema + UI where entries can be entered manually.
|
|
|
|
### Milestone 2 — Deterministic Scanner
|
|
|
|
Add repo clone + README/docs/tests/interface detection.
|
|
|
|
### Milestone 3 — LLM Candidate Extraction
|
|
|
|
Generate candidate ability/capability/feature graph.
|
|
|
|
### Milestone 4 — Review Workflow
|
|
|
|
Approve/edit/reject extracted entries.
|
|
|
|
### Milestone 5 — Search
|
|
|
|
Add semantic search over approved registry entries.
|
|
|
|
---
|
|
|
|
## 12. Architecture principle
|
|
|
|
> Deterministic scanners establish facts.
|
|
> LLMs propose interpretations.
|
|
> Humans or trusted agents approve registry truth.
|
|
|
|
That should be the backbone.
|
|
|
|
|
|
xxx
|