# Repository Ability Registry Implementation Workplan ## 1. Documentation Review Summary The wiki defines a coherent v0.1 product: a registry that turns Git repositories into reviewable, source-linked maps of: ```text Ability -> Capability -> Feature -> Evidence -> Code location ``` The strongest architectural principle across the docs is: ```text deterministic scanners establish observed facts LLM-assisted extractors propose interpreted claims humans or trusted agents approve registry truth ``` This should remain the core design constraint for implementation. The system should be conservative, explainable, reviewable, and source-linked rather than attempting fully automatic code understanding. ## 2. MVP Scope The first version should implement the core journey documented in the PRD, FRS, architecture sketch, and use-case catalog: ```text Register repository Analyze repository Generate candidate ability/capability/feature/evidence map Review and approve candidates Publish registry profile Search and inspect repositories ``` In scope for v0.1: - Repository registration by Git URL - Repository metadata and snapshot tracking - Deterministic repository scan - Candidate extraction for abilities, capabilities, features, and evidence - Human review actions: edit, approve, reject, merge, relink - Inspectable ability map - Natural-language search over approved registry entries - API access for repositories, ability maps, capabilities, and search Out of scope for v0.1: - Continuous GitHub app integration - Full static code understanding - Advanced ontology enforcement - Distributed indexing - Benchmark execution - Marketplace features - Complex access control - Automated truth claims without review ## 3. Recommended Technical Baseline Use a pragmatic stack that keeps the analyzer and registry easy to evolve: - Backend: Python FastAPI - Database: PostgreSQL - Semantic search: pgvector inside PostgreSQL - Worker: simple background jobs first; graduate to RQ or Celery when needed - Git access: subprocess git or GitPython - Frontend: React/Next.js or server-rendered FastAPI templates for earliest prototype - LLM extraction: provider-abstracted interface - Local artifact storage: filesystem under an application data directory For the first implementation pass, prefer a modular monolith over distributed services. Keep clean module boundaries internally, but avoid operational complexity until the product loop is proven. ## 4. Core Domain Model Implement these entities first: - Repository - RepositorySnapshot - AnalysisRun - ObservedFact - CandidateAbility - CandidateCapability - CandidateFeature - CandidateEvidence - ApprovedAbility - ApprovedCapability - ApprovedFeature - ApprovedEvidence - SourceReference - ReviewDecision The model should preserve a clear distinction between observed facts and interpreted claims. Observed facts include things like: - File paths - Documentation files - Test files - Package manifests - API routes - CLI commands - Public modules/functions - Detected languages/frameworks Interpreted claims include: - Ability names and descriptions - Capability names and descriptions - Feature-to-capability links - Evidence-to-capability links - Confidence scores ## 5. Suggested Module Boundaries Use the architecture sketch's boundaries as implementation modules: - `repo_ingestion`: validate Git URLs, clone/fetch repos, resolve branch/commit - `repo_scanning`: deterministic file tree, language, docs, tests, examples, API/CLI detection - `content_indexing`: text extraction, chunking, source references, embeddings - `llm_extraction`: prompt orchestration and structured candidate generation - `candidate_graph`: build and validate ability/capability/feature/evidence relationships - `review_workflow`: edit, approve, reject, merge, relink, publish - `registry_query`: search, filters, profile retrieval, ability-map assembly - `web_api`: HTTP endpoints and request/response schemas - `web_ui`: registration, analysis, review, profile, and search screens ## 6. Milestones ### Milestone 0: Project Foundation Goal: establish the application skeleton and development path. Deliverables: - Backend app skeleton - Database migration setup - Configuration system - Local development instructions - Basic test harness - Health endpoint Acceptance criteria: - App starts locally - Tests run locally - Database migrations apply cleanly ### Milestone 1: Manual Registry Goal: prove the core data model and inspection experience before automation. Deliverables: - Repository CRUD - Manual ability/capability/feature/evidence CRUD - Ability map endpoint - Basic repository profile UI Acceptance criteria: - A user can create a repository profile by hand - The UI displays `Ability -> Capability -> Feature -> Evidence` - API returns the same map as structured JSON ### Milestone 2: Git Ingestion and Deterministic Scanner Goal: establish trustworthy observed facts from repository contents. Deliverables: - Git URL validation - Clone/fetch and checkout - Snapshot record with branch and commit hash - File tree scan - README/docs/examples/tests/package manifest detection - Basic language/framework/interface detection - Analysis run status tracking Acceptance criteria: - A public Git repository can be registered and analyzed - The system records a snapshot and deterministic scan summary - Analysis failures are visible without corrupting prior data ### Milestone 3: Reviewable Candidate Graph Goal: generate candidate registry entries from deterministic facts and extracted content. Deliverables: - Content extraction from README, docs, examples, tests, package metadata, and selected source files - Source references with file paths and line ranges where possible - Candidate ability generation - Candidate capability generation - Candidate feature generation - Candidate evidence detection - Confidence scoring using the documented additive factors - Candidate graph endpoint and UI Acceptance criteria: - Analysis produces candidates with source references and confidence - Candidates distinguish observed facts from interpreted claims - Candidate output is explainable enough for curator review ### Milestone 4: Review and Approval Workflow Goal: turn candidates into canonical registry entries. Deliverables: - Approve/reject candidate entries - Edit names, descriptions, confidence, and relationships - Merge duplicate abilities/capabilities/features - Relink capabilities, features, and evidence - Publish approved repository profile - Persist review decisions Acceptance criteria: - A curator can correct and approve an analysis result - Only approved entries appear in canonical search/profile views - Repository status changes from analyzed to indexed/published ### Milestone 5: Search and Inspection Goal: make the registry useful for discovery. Deliverables: - Text search over repositories, abilities, capabilities, and descriptions - Semantic search with pgvector - Search filters for language, framework, and ability/capability presence - Search UI - Repository profile drill-down UI - Code/evidence links from features and capabilities Acceptance criteria: - A user can search by need using natural language - Results show repository, matching ability/capability, confidence, and evidence level - A user can drill from a search result into the ability map and code/evidence references ### Milestone 6: API Completeness for Agents Goal: support programmatic consumers cleanly. Deliverables: - `GET /repos` - `POST /repos` - `GET /repos/{id}` - `POST /repos/{id}/analysis-runs` - `GET /repos/{id}/analysis-runs/{run_id}` - `GET /repos/{id}/ability-map` - `GET /abilities` - `GET /capabilities` - `GET /search?q=...` - OpenAPI examples Acceptance criteria: - API covers repository registration, analysis, search, and inspection - Responses are stable enough for agent/tooling integration - OpenAPI docs describe all MVP endpoints ## 7. Initial Database Shape Start with tables for: - `repositories` - `repository_snapshots` - `analysis_runs` - `observed_facts` - `source_references` - `candidate_abilities` - `candidate_capabilities` - `candidate_features` - `candidate_evidence` - `candidate_links` - `approved_abilities` - `approved_capabilities` - `approved_features` - `approved_evidence` - `approved_links` - `review_decisions` - `content_chunks` - `embeddings` Use status fields consistently: ```text registered ingesting analyzing analysis_failed analyzed reviewing indexed ``` ## 8. Analyzer v0.1 Strategy The first analyzer should be intentionally modest. Deterministic scan: - Identify repo root metadata files - Identify docs, examples, tests, package manifests, API specs, config files - Detect languages from extensions and package files - Detect common frameworks from manifests - Detect likely API/CLI features using simple framework-specific scanners Content extraction: - README and docs first - Examples and tests second - Selected source files only when they expose interfaces - Preserve path and line references LLM extraction: - Use separate prompts for abilities, capabilities, features, and evidence - Request structured JSON - Require source references for each candidate - Reject or mark speculative any candidate without supporting sources Confidence scoring: - Start from the documented additive model - Normalize to `0.0-1.0` - Store both numeric confidence and label ## 9. UI Workplan Build application screens in this order: 1. Repository list 2. Repository registration 3. Repository detail and analysis status 4. Deterministic scan summary 5. Candidate review tree 6. Published repository profile 7. Search The UI should feel like an operational tool rather than a marketing site: dense, clear, review-focused, and optimized for repeated curator work. ## 10. Testing Strategy Add tests around the highest-risk boundaries: - Database migrations and model relationships - Git URL validation - Scanner output for fixture repositories - Candidate graph validation - Review workflow transitions - Search result ranking and filtering - API contract tests for MVP endpoints Create small fixture repositories for: - README-only repository - Python CLI repository - FastAPI repository - JavaScript/TypeScript package - Repository with tests and examples - Repository with weak or misleading docs ## 11. Key Risks and Mitigations Extraction quality risk: - Require source references. - Keep candidates reviewable. - Separate observed facts from interpreted claims. Over-complex ontology risk: - Keep v0.1 schema minimal. - Avoid enforcing deep taxonomy too early. Search quality risk: - Combine relational filters, full-text search, and vector search. - Show why a result matched. Operational complexity risk: - Start as a modular monolith. - Use simple jobs before adding worker infrastructure. Trust risk: - Never publish unapproved claims as canonical truth. - Preserve analysis run history and review decisions. ## 12. Immediate Next Actions Recommended next implementation sequence: 1. Scaffold the FastAPI application, database migrations, and test harness. 2. Implement the core schema for repositories, snapshots, analysis runs, observed facts, candidates, and approved entries. 3. Add manual registry CRUD and ability-map API. 4. Build a minimal repository list/profile UI. 5. Add Git ingestion and deterministic scanning. 6. Add candidate graph generation and review workflow. The first meaningful demo should be: ```text Create a repository Add or generate an ability map Approve it Search for a capability Open the repository profile Drill down to feature and evidence locations ```