7.0 KiB
ArchitectureSketch
- Repository Scoping — Architecture v0.1*
Repository Scoping — Architecture v0.1
1. Core architectural idea
Use a pipeline + registry + inspection UI architecture.
Git Repo
↓
Ingestion
↓
Analysis Pipeline
↓
Candidate Registry Entries
↓
Human Review / Approval
↓
Searchable Ability Registry
↓
Web UI / API / CLI
The system should not pretend the first analysis is truth. It produces reviewable candidates.
2. Main components
1. Registry Web App
Purpose:
- register repositories
- trigger analysis
- review results
- inspect ability maps
- search repos
Could be a normal web app with:
Frontend + Backend API + Database
2. Git Ingestion Service
Responsibilities:
- clone/pull repositories
- checkout commit
- store snapshot metadata
- detect repo structure
Outputs:
repo_snapshot:
repo_id
commit_hash
branch
file_tree
metadata_files
3. Repository Analyzer
This is the heart.
Pipeline stages:
Structure Scanner
Documentation Scanner
Interface Scanner
Test Scanner
LLM-Assisted Extractor
Evidence Linker
Confidence Scorer
Important: split deterministic scanners from LLM extraction.
4. Candidate Registry Store
Stores unapproved results:
candidate abilities
candidate capabilities
candidate features
candidate evidence
source references
confidence scores
These are editable.
5. Curator Review Layer
Allows a human or agent to:
- accept
- reject
- rename
- merge
- relink
- approve
This turns candidates into official registry entries.
6. Search / Query Layer
Supports:
natural language search
ability search
capability search
repo search
feature search
evidence search
Use both:
- relational filters
- vector/semantic search
7. Public/Agent API
Expose structured access:
GET /repos
GET /repos/{id}
GET /abilities
GET /capabilities
GET /search?q=...
GET /repos/{id}/ability-map
Later this becomes MCP-friendly.
3. Suggested storage architecture
Use a hybrid model:
PostgreSQL
For canonical structured data:
repositories
snapshots
abilities
capabilities
features
evidence
links
analysis_runs
review_status
Vector index
For semantic search over:
README chunks
docs chunks
ability descriptions
capability descriptions
feature descriptions
Start simple with pgvector inside PostgreSQL.
Object/file storage
For:
repo snapshots
analysis artifacts
parsed file summaries
exported registry YAML
Local filesystem is fine for MVP.
4. Data model sketch
Repository
has many Snapshots
has many AnalysisRuns
has many RegistryEntries
Snapshot
commit_hash
branch
file_tree
extracted_documents
AnalysisRun
snapshot_id
status
started_at
completed_at
model_used
analyzer_version
Ability
repo_id
name
description
confidence
status
Capability
repo_id
ability_id
name
description
inputs
outputs
confidence
status
Feature
repo_id
capability_id
name
type
location
confidence
status
Evidence
repo_id
capability_id
type
path
strength
5. Analysis pipeline
Step 1 — Clone / update repo
git clone
checkout commit
record commit hash
Step 2 — Deterministic scan
Detect:
languages
frameworks
package managers
entrypoints
routes
CLI commands
tests
docs
examples
config files
This should be deterministic code, not LLM.
Step 3 — Content chunking
Create chunks from:
README
docs
examples
tests
API specs
selected source files
Each chunk keeps:
file path
line range
content type
semantic role
Step 4 — LLM-assisted extraction
Ask the model separately for:
candidate abilities
candidate capabilities
candidate features
evidence mappings
Do not ask one giant prompt to do everything.
Step 5 — Confidence scoring
Combine:
LLM confidence
source quality
tests present
examples present
implementation found
multiple-source agreement
Step 6 — Candidate graph generation
Output:
Ability
→ Capability
→ Feature
→ Evidence
Step 7 — Review
Only approved entries become canonical.
6. Important design decision
Separate:
Observed facts
from:
Interpreted claims
Example:
Observed fact
file: src/routes/classify.py
route: POST /classify
Interpreted claim
capability: Classify Incoming Email
The first is source-derived. The second is inferred.
This distinction is crucial for trust.
7. MVP technology suggestion
A very pragmatic stack:
Backend: Python FastAPI
DB: PostgreSQL + pgvector
Worker: Celery/RQ or simple background jobs
Git analysis: GitPython / subprocess git
Frontend: React / Next.js or simple server-rendered app
LLM extraction: provider-abstracted interface
Given your broader agent/tooling context, Python is probably best for the analyzer.
8. Web UI structure
Page 1 — Repository List
Shows:
name
description
status
last analyzed
top abilities
Page 2 — Register Repository
Input:
Git URL
branch
access token optional
Page 3 — Analysis Run
Shows:
scan progress
detected structure
candidate entries
warnings
Page 4 — Review
Tree view:
Ability
Capability
Feature
Evidence
Actions:
approve
edit
reject
merge
relink
Page 5 — Repository Profile
Final inspectable view.
Page 6 — Search
Natural-language search with filters:
domain
language
framework
capability type
maturity
evidence strength
9. Internal API boundaries
Keep clean module boundaries:
repo_ingestion
repo_scanning
content_indexing
llm_extraction
candidate_graph
review_workflow
registry_query
web_api
This prevents the analyzer from becoming a ball of mud.
10. What to avoid in v0.1
Do not build yet:
continuous GitHub app integration
full static code analysis
full ontology engine
automatic truth claims
complex permission system
benchmark execution
marketplace functionality
MVP should prove:
Can we register repos, extract useful maps, review them, and search them?
11. Recommended first implementation path
Milestone 1 — Manual Registry
Create schema + UI where entries can be entered manually.
Milestone 2 — Deterministic Scanner
Add repo clone + README/docs/tests/interface detection.
Milestone 3 — LLM Candidate Extraction
Generate candidate ability/capability/feature graph.
Milestone 4 — Review Workflow
Approve/edit/reject extracted entries.
Milestone 5 — Search
Add semantic search over approved registry entries.
12. Architecture principle
Deterministic scanners establish facts. LLMs propose interpretations. Humans or trusted agents approve registry truth.
That should be the backbone.
xxx