ArchitectureSketch * Repository Scoping — Architecture v0.1* # Repository Scoping — Architecture v0.1 ## 1. Core architectural idea Use a **pipeline + registry + inspection UI** architecture. ```text Git Repo ↓ Ingestion ↓ Analysis Pipeline ↓ Candidate Registry Entries ↓ Human Review / Approval ↓ Searchable Ability Registry ↓ Web UI / API / CLI ``` The system should not pretend the first analysis is truth. It produces **reviewable candidates**. --- ## 2. Main components ### 1. Registry Web App Purpose: * register repositories * trigger analysis * review results * inspect ability maps * search repos Could be a normal web app with: ```text Frontend + Backend API + Database ``` --- ### 2. Git Ingestion Service Responsibilities: * clone/pull repositories * checkout commit * store snapshot metadata * detect repo structure Outputs: ```yaml repo_snapshot: repo_id commit_hash branch file_tree metadata_files ``` --- ### 3. Repository Analyzer This is the heart. Pipeline stages: ```text Structure Scanner Documentation Scanner Interface Scanner Test Scanner LLM-Assisted Extractor Evidence Linker Confidence Scorer ``` Important: split deterministic scanners from LLM extraction. --- ### 4. Candidate Registry Store Stores unapproved results: ```text candidate abilities candidate capabilities candidate features candidate evidence source references confidence scores ``` These are editable. --- ### 5. Curator Review Layer Allows a human or agent to: * accept * reject * rename * merge * relink * approve This turns candidates into official registry entries. --- ### 6. Search / Query Layer Supports: ```text natural language search ability search capability search repo search feature search evidence search ``` Use both: * relational filters * vector/semantic search --- ### 7. Public/Agent API Expose structured access: ```http GET /repos GET /repos/{id} GET /abilities GET /capabilities GET /search?q=... GET /repos/{id}/ability-map ``` Later this becomes MCP-friendly. --- ## 3. Suggested storage architecture Use a hybrid model: ### PostgreSQL For canonical structured data: ```text repositories snapshots abilities capabilities features evidence links analysis_runs review_status ``` ### Vector index For semantic search over: ```text README chunks docs chunks ability descriptions capability descriptions feature descriptions ``` Start simple with `pgvector` inside PostgreSQL. ### Object/file storage For: ```text repo snapshots analysis artifacts parsed file summaries exported registry YAML ``` Local filesystem is fine for MVP. --- ## 4. Data model sketch ```text Repository has many Snapshots has many AnalysisRuns has many RegistryEntries Snapshot commit_hash branch file_tree extracted_documents AnalysisRun snapshot_id status started_at completed_at model_used analyzer_version Ability repo_id name description confidence status Capability repo_id ability_id name description inputs outputs confidence status Feature repo_id capability_id name type location confidence status Evidence repo_id capability_id type path strength ``` --- ## 5. Analysis pipeline ### Step 1 — Clone / update repo ```text git clone checkout commit record commit hash ``` --- ### Step 2 — Deterministic scan Detect: ```text languages frameworks package managers entrypoints routes CLI commands tests docs examples config files ``` This should be deterministic code, not LLM. --- ### Step 3 — Content chunking Create chunks from: ```text README docs examples tests API specs selected source files ``` Each chunk keeps: ```text file path line range content type semantic role ``` --- ### Step 4 — LLM-assisted extraction Ask the model separately for: ```text candidate abilities candidate capabilities candidate features evidence mappings ``` Do not ask one giant prompt to do everything. --- ### Step 5 — Confidence scoring Combine: ```text LLM confidence source quality tests present examples present implementation found multiple-source agreement ``` --- ### Step 6 — Candidate graph generation Output: ```text Ability → Capability → Feature → Evidence ``` --- ### Step 7 — Review Only approved entries become canonical. --- ## 6. Important design decision Separate: ```text Observed facts ``` from: ```text Interpreted claims ``` Example: ### Observed fact ```yaml file: src/routes/classify.py route: POST /classify ``` ### Interpreted claim ```yaml capability: Classify Incoming Email ``` The first is source-derived. The second is inferred. This distinction is crucial for trust. --- ## 7. MVP technology suggestion A very pragmatic stack: ```text Backend: Python FastAPI DB: PostgreSQL + pgvector Worker: Celery/RQ or simple background jobs Git analysis: GitPython / subprocess git Frontend: React / Next.js or simple server-rendered app LLM extraction: provider-abstracted interface ``` Given your broader agent/tooling context, Python is probably best for the analyzer. --- ## 8. Web UI structure ### Page 1 — Repository List Shows: ```text name description status last analyzed top abilities ``` ### Page 2 — Register Repository Input: ```text Git URL branch access token optional ``` ### Page 3 — Analysis Run Shows: ```text scan progress detected structure candidate entries warnings ``` ### Page 4 — Review Tree view: ```text Ability Capability Feature Evidence ``` Actions: ```text approve edit reject merge relink ``` ### Page 5 — Repository Profile Final inspectable view. ### Page 6 — Search Natural-language search with filters: ```text domain language framework capability type maturity evidence strength ``` --- ## 9. Internal API boundaries Keep clean module boundaries: ```text repo_ingestion repo_scanning content_indexing llm_extraction candidate_graph review_workflow registry_query web_api ``` This prevents the analyzer from becoming a ball of mud. --- ## 10. What to avoid in v0.1 Do not build yet: ```text continuous GitHub app integration full static code analysis full ontology engine automatic truth claims complex permission system benchmark execution marketplace functionality ``` MVP should prove: > Can we register repos, extract useful maps, review them, and search them? --- ## 11. Recommended first implementation path ### Milestone 1 — Manual Registry Create schema + UI where entries can be entered manually. ### Milestone 2 — Deterministic Scanner Add repo clone + README/docs/tests/interface detection. ### Milestone 3 — LLM Candidate Extraction Generate candidate ability/capability/feature graph. ### Milestone 4 — Review Workflow Approve/edit/reject extracted entries. ### Milestone 5 — Search Add semantic search over approved registry entries. --- ## 12. Architecture principle > Deterministic scanners establish facts. > LLMs propose interpretations. > Humans or trusted agents approve registry truth. That should be the backbone. xxx