11 KiB
Repository Ability Registry Implementation Workplan
1. Documentation Review Summary
The wiki defines a coherent v0.1 product: a registry that turns Git repositories into reviewable, source-linked maps of:
Ability -> Capability -> Feature -> Evidence -> Code location
The strongest architectural principle across the docs is:
deterministic scanners establish observed facts
LLM-assisted extractors propose interpreted claims
humans or trusted agents approve registry truth
This should remain the core design constraint for implementation. The system should be conservative, explainable, reviewable, and source-linked rather than attempting fully automatic code understanding.
2. MVP Scope
The first version should implement the core journey documented in the PRD, FRS, architecture sketch, and use-case catalog:
Register repository
Analyze repository
Generate candidate ability/capability/feature/evidence map
Review and approve candidates
Publish registry profile
Search and inspect repositories
In scope for v0.1:
- Repository registration by Git URL
- Repository metadata and snapshot tracking
- Deterministic repository scan
- Candidate extraction for abilities, capabilities, features, and evidence
- Human review actions: edit, approve, reject, merge, relink
- Inspectable ability map
- Natural-language search over approved registry entries
- API access for repositories, ability maps, capabilities, and search
Out of scope for v0.1:
- Continuous GitHub app integration
- Full static code understanding
- Advanced ontology enforcement
- Distributed indexing
- Benchmark execution
- Marketplace features
- Complex access control
- Automated truth claims without review
3. Recommended Technical Baseline
Use a pragmatic stack that keeps the analyzer and registry easy to evolve:
- Backend: Python FastAPI
- Database: PostgreSQL
- Semantic search: pgvector inside PostgreSQL
- Worker: simple background jobs first; graduate to RQ or Celery when needed
- Git access: subprocess git or GitPython
- Frontend: React/Next.js or server-rendered FastAPI templates for earliest prototype
- LLM extraction: provider-abstracted interface
- Local artifact storage: filesystem under an application data directory
For the first implementation pass, prefer a modular monolith over distributed services. Keep clean module boundaries internally, but avoid operational complexity until the product loop is proven.
4. Core Domain Model
Implement these entities first:
- Repository
- RepositorySnapshot
- AnalysisRun
- ObservedFact
- CandidateAbility
- CandidateCapability
- CandidateFeature
- CandidateEvidence
- ApprovedAbility
- ApprovedCapability
- ApprovedFeature
- ApprovedEvidence
- SourceReference
- ReviewDecision
The model should preserve a clear distinction between observed facts and interpreted claims.
Observed facts include things like:
- File paths
- Documentation files
- Test files
- Package manifests
- API routes
- CLI commands
- Public modules/functions
- Detected languages/frameworks
Interpreted claims include:
- Ability names and descriptions
- Capability names and descriptions
- Feature-to-capability links
- Evidence-to-capability links
- Confidence scores
5. Suggested Module Boundaries
Use the architecture sketch's boundaries as implementation modules:
repo_ingestion: validate Git URLs, clone/fetch repos, resolve branch/commitrepo_scanning: deterministic file tree, language, docs, tests, examples, API/CLI detectioncontent_indexing: text extraction, chunking, source references, embeddingsllm_extraction: prompt orchestration and structured candidate generationcandidate_graph: build and validate ability/capability/feature/evidence relationshipsreview_workflow: edit, approve, reject, merge, relink, publishregistry_query: search, filters, profile retrieval, ability-map assemblyweb_api: HTTP endpoints and request/response schemasweb_ui: registration, analysis, review, profile, and search screens
6. Milestones
Milestone 0: Project Foundation
Goal: establish the application skeleton and development path.
Deliverables:
- Backend app skeleton
- Database migration setup
- Configuration system
- Local development instructions
- Basic test harness
- Health endpoint
Acceptance criteria:
- App starts locally
- Tests run locally
- Database migrations apply cleanly
Milestone 1: Manual Registry
Goal: prove the core data model and inspection experience before automation.
Deliverables:
- Repository CRUD
- Manual ability/capability/feature/evidence CRUD
- Ability map endpoint
- Basic repository profile UI
Acceptance criteria:
- A user can create a repository profile by hand
- The UI displays
Ability -> Capability -> Feature -> Evidence - API returns the same map as structured JSON
Milestone 2: Git Ingestion and Deterministic Scanner
Goal: establish trustworthy observed facts from repository contents.
Deliverables:
- Git URL validation
- Clone/fetch and checkout
- Snapshot record with branch and commit hash
- File tree scan
- README/docs/examples/tests/package manifest detection
- Basic language/framework/interface detection
- Analysis run status tracking
Acceptance criteria:
- A public Git repository can be registered and analyzed
- The system records a snapshot and deterministic scan summary
- Analysis failures are visible without corrupting prior data
Milestone 3: Reviewable Candidate Graph
Goal: generate candidate registry entries from deterministic facts and extracted content.
Deliverables:
- Content extraction from README, docs, examples, tests, package metadata, and selected source files
- Source references with file paths and line ranges where possible
- Candidate ability generation
- Candidate capability generation
- Candidate feature generation
- Candidate evidence detection
- Confidence scoring using the documented additive factors
- Candidate graph endpoint and UI
Acceptance criteria:
- Analysis produces candidates with source references and confidence
- Candidates distinguish observed facts from interpreted claims
- Candidate output is explainable enough for curator review
Milestone 4: Review and Approval Workflow
Goal: turn candidates into canonical registry entries.
Deliverables:
- Approve/reject candidate entries
- Edit names, descriptions, confidence, and relationships
- Merge duplicate abilities/capabilities/features
- Relink capabilities, features, and evidence
- Publish approved repository profile
- Persist review decisions
Acceptance criteria:
- A curator can correct and approve an analysis result
- Only approved entries appear in canonical search/profile views
- Repository status changes from analyzed to indexed/published
Milestone 5: Search and Inspection
Goal: make the registry useful for discovery.
Deliverables:
- Text search over repositories, abilities, capabilities, and descriptions
- Semantic search with pgvector
- Search filters for language, framework, and ability/capability presence
- Search UI
- Repository profile drill-down UI
- Code/evidence links from features and capabilities
Acceptance criteria:
- A user can search by need using natural language
- Results show repository, matching ability/capability, confidence, and evidence level
- A user can drill from a search result into the ability map and code/evidence references
Milestone 6: API Completeness for Agents
Goal: support programmatic consumers cleanly.
Deliverables:
GET /reposPOST /reposGET /repos/{id}POST /repos/{id}/analysis-runsGET /repos/{id}/analysis-runs/{run_id}GET /repos/{id}/ability-mapGET /abilitiesGET /capabilitiesGET /search?q=...- OpenAPI examples
Acceptance criteria:
- API covers repository registration, analysis, search, and inspection
- Responses are stable enough for agent/tooling integration
- OpenAPI docs describe all MVP endpoints
7. Initial Database Shape
Start with tables for:
repositoriesrepository_snapshotsanalysis_runsobserved_factssource_referencescandidate_abilitiescandidate_capabilitiescandidate_featurescandidate_evidencecandidate_linksapproved_abilitiesapproved_capabilitiesapproved_featuresapproved_evidenceapproved_linksreview_decisionscontent_chunksembeddings
Use status fields consistently:
registered
ingesting
analyzing
analysis_failed
analyzed
reviewing
indexed
8. Analyzer v0.1 Strategy
The first analyzer should be intentionally modest.
Deterministic scan:
- Identify repo root metadata files
- Identify docs, examples, tests, package manifests, API specs, config files
- Detect languages from extensions and package files
- Detect common frameworks from manifests
- Detect likely API/CLI features using simple framework-specific scanners
Content extraction:
- README and docs first
- Examples and tests second
- Selected source files only when they expose interfaces
- Preserve path and line references
LLM extraction:
- Use separate prompts for abilities, capabilities, features, and evidence
- Request structured JSON
- Require source references for each candidate
- Reject or mark speculative any candidate without supporting sources
Confidence scoring:
- Start from the documented additive model
- Normalize to
0.0-1.0 - Store both numeric confidence and label
9. UI Workplan
Build application screens in this order:
- Repository list
- Repository registration
- Repository detail and analysis status
- Deterministic scan summary
- Candidate review tree
- Published repository profile
- Search
The UI should feel like an operational tool rather than a marketing site: dense, clear, review-focused, and optimized for repeated curator work.
10. Testing Strategy
Add tests around the highest-risk boundaries:
- Database migrations and model relationships
- Git URL validation
- Scanner output for fixture repositories
- Candidate graph validation
- Review workflow transitions
- Search result ranking and filtering
- API contract tests for MVP endpoints
Create small fixture repositories for:
- README-only repository
- Python CLI repository
- FastAPI repository
- JavaScript/TypeScript package
- Repository with tests and examples
- Repository with weak or misleading docs
11. Key Risks and Mitigations
Extraction quality risk:
- Require source references.
- Keep candidates reviewable.
- Separate observed facts from interpreted claims.
Over-complex ontology risk:
- Keep v0.1 schema minimal.
- Avoid enforcing deep taxonomy too early.
Search quality risk:
- Combine relational filters, full-text search, and vector search.
- Show why a result matched.
Operational complexity risk:
- Start as a modular monolith.
- Use simple jobs before adding worker infrastructure.
Trust risk:
- Never publish unapproved claims as canonical truth.
- Preserve analysis run history and review decisions.
12. Immediate Next Actions
Recommended next implementation sequence:
- Scaffold the FastAPI application, database migrations, and test harness.
- Implement the core schema for repositories, snapshots, analysis runs, observed facts, candidates, and approved entries.
- Add manual registry CRUD and ability-map API.
- Build a minimal repository list/profile UI.
- Add Git ingestion and deterministic scanning.
- Add candidate graph generation and review workflow.
The first meaningful demo should be:
Create a repository
Add or generate an ability map
Approve it
Search for a capability
Open the repository profile
Drill down to feature and evidence locations