Added codex generated workplan

2026-04-25 21:33:20 +02:00
parent 8d3d5aab42
commit a833d4f82c
1 changed files with 415 additions and 0 deletions
--- a/workplans/ImplementationWorkplan.md
+++ b/workplans/ImplementationWorkplan.md
@@ -0,0 +1,415 @@
+# Repository Ability Registry Implementation Workplan
+
+## 1. Documentation Review Summary
+
+The wiki defines a coherent v0.1 product: a registry that turns Git repositories into reviewable, source-linked maps of:
+
+```text
+Ability -> Capability -> Feature -> Evidence -> Code location
+```
+
+The strongest architectural principle across the docs is:
+
+```text
+deterministic scanners establish observed facts
+LLM-assisted extractors propose interpreted claims
+humans or trusted agents approve registry truth
+```
+
+This should remain the core design constraint for implementation. The system should be conservative, explainable, reviewable, and source-linked rather than attempting fully automatic code understanding.
+
+## 2. MVP Scope
+
+The first version should implement the core journey documented in the PRD, FRS, architecture sketch, and use-case catalog:
+
+```text
+Register repository
+Analyze repository
+Generate candidate ability/capability/feature/evidence map
+Review and approve candidates
+Publish registry profile
+Search and inspect repositories
+```
+
+In scope for v0.1:
+
+- Repository registration by Git URL
+- Repository metadata and snapshot tracking
+- Deterministic repository scan
+- Candidate extraction for abilities, capabilities, features, and evidence
+- Human review actions: edit, approve, reject, merge, relink
+- Inspectable ability map
+- Natural-language search over approved registry entries
+- API access for repositories, ability maps, capabilities, and search
+
+Out of scope for v0.1:
+
+- Continuous GitHub app integration
+- Full static code understanding
+- Advanced ontology enforcement
+- Distributed indexing
+- Benchmark execution
+- Marketplace features
+- Complex access control
+- Automated truth claims without review
+
+## 3. Recommended Technical Baseline
+
+Use a pragmatic stack that keeps the analyzer and registry easy to evolve:
+
+- Backend: Python FastAPI
+- Database: PostgreSQL
+- Semantic search: pgvector inside PostgreSQL
+- Worker: simple background jobs first; graduate to RQ or Celery when needed
+- Git access: subprocess git or GitPython
+- Frontend: React/Next.js or server-rendered FastAPI templates for earliest prototype
+- LLM extraction: provider-abstracted interface
+- Local artifact storage: filesystem under an application data directory
+
+For the first implementation pass, prefer a modular monolith over distributed services. Keep clean module boundaries internally, but avoid operational complexity until the product loop is proven.
+
+## 4. Core Domain Model
+
+Implement these entities first:
+
+- Repository
+- RepositorySnapshot
+- AnalysisRun
+- ObservedFact
+- CandidateAbility
+- CandidateCapability
+- CandidateFeature
+- CandidateEvidence
+- ApprovedAbility
+- ApprovedCapability
+- ApprovedFeature
+- ApprovedEvidence
+- SourceReference
+- ReviewDecision
+
+The model should preserve a clear distinction between observed facts and interpreted claims.
+
+Observed facts include things like:
+
+- File paths
+- Documentation files
+- Test files
+- Package manifests
+- API routes
+- CLI commands
+- Public modules/functions
+- Detected languages/frameworks
+
+Interpreted claims include:
+
+- Ability names and descriptions
+- Capability names and descriptions
+- Feature-to-capability links
+- Evidence-to-capability links
+- Confidence scores
+
+## 5. Suggested Module Boundaries
+
+Use the architecture sketch's boundaries as implementation modules:
+
+- `repo_ingestion`: validate Git URLs, clone/fetch repos, resolve branch/commit
+- `repo_scanning`: deterministic file tree, language, docs, tests, examples, API/CLI detection
+- `content_indexing`: text extraction, chunking, source references, embeddings
+- `llm_extraction`: prompt orchestration and structured candidate generation
+- `candidate_graph`: build and validate ability/capability/feature/evidence relationships
+- `review_workflow`: edit, approve, reject, merge, relink, publish
+- `registry_query`: search, filters, profile retrieval, ability-map assembly
+- `web_api`: HTTP endpoints and request/response schemas
+- `web_ui`: registration, analysis, review, profile, and search screens
+
+## 6. Milestones
+
+### Milestone 0: Project Foundation
+
+Goal: establish the application skeleton and development path.
+
+Deliverables:
+
+- Backend app skeleton
+- Database migration setup
+- Configuration system
+- Local development instructions
+- Basic test harness
+- Health endpoint
+
+Acceptance criteria:
+
+- App starts locally
+- Tests run locally
+- Database migrations apply cleanly
+
+### Milestone 1: Manual Registry
+
+Goal: prove the core data model and inspection experience before automation.
+
+Deliverables:
+
+- Repository CRUD
+- Manual ability/capability/feature/evidence CRUD
+- Ability map endpoint
+- Basic repository profile UI
+
+Acceptance criteria:
+
+- A user can create a repository profile by hand
+- The UI displays `Ability -> Capability -> Feature -> Evidence`
+- API returns the same map as structured JSON
+
+### Milestone 2: Git Ingestion and Deterministic Scanner
+
+Goal: establish trustworthy observed facts from repository contents.
+
+Deliverables:
+
+- Git URL validation
+- Clone/fetch and checkout
+- Snapshot record with branch and commit hash
+- File tree scan
+- README/docs/examples/tests/package manifest detection
+- Basic language/framework/interface detection
+- Analysis run status tracking
+
+Acceptance criteria:
+
+- A public Git repository can be registered and analyzed
+- The system records a snapshot and deterministic scan summary
+- Analysis failures are visible without corrupting prior data
+
+### Milestone 3: Reviewable Candidate Graph
+
+Goal: generate candidate registry entries from deterministic facts and extracted content.
+
+Deliverables:
+
+- Content extraction from README, docs, examples, tests, package metadata, and selected source files
+- Source references with file paths and line ranges where possible
+- Candidate ability generation
+- Candidate capability generation
+- Candidate feature generation
+- Candidate evidence detection
+- Confidence scoring using the documented additive factors
+- Candidate graph endpoint and UI
+
+Acceptance criteria:
+
+- Analysis produces candidates with source references and confidence
+- Candidates distinguish observed facts from interpreted claims
+- Candidate output is explainable enough for curator review
+
+### Milestone 4: Review and Approval Workflow
+
+Goal: turn candidates into canonical registry entries.
+
+Deliverables:
+
+- Approve/reject candidate entries
+- Edit names, descriptions, confidence, and relationships
+- Merge duplicate abilities/capabilities/features
+- Relink capabilities, features, and evidence
+- Publish approved repository profile
+- Persist review decisions
+
+Acceptance criteria:
+
+- A curator can correct and approve an analysis result
+- Only approved entries appear in canonical search/profile views
+- Repository status changes from analyzed to indexed/published
+
+### Milestone 5: Search and Inspection
+
+Goal: make the registry useful for discovery.
+
+Deliverables:
+
+- Text search over repositories, abilities, capabilities, and descriptions
+- Semantic search with pgvector
+- Search filters for language, framework, and ability/capability presence
+- Search UI
+- Repository profile drill-down UI
+- Code/evidence links from features and capabilities
+
+Acceptance criteria:
+
+- A user can search by need using natural language
+- Results show repository, matching ability/capability, confidence, and evidence level
+- A user can drill from a search result into the ability map and code/evidence references
+
+### Milestone 6: API Completeness for Agents
+
+Goal: support programmatic consumers cleanly.
+
+Deliverables:
+
+- `GET /repos`
+- `POST /repos`
+- `GET /repos/{id}`
+- `POST /repos/{id}/analysis-runs`
+- `GET /repos/{id}/analysis-runs/{run_id}`
+- `GET /repos/{id}/ability-map`
+- `GET /abilities`
+- `GET /capabilities`
+- `GET /search?q=...`
+- OpenAPI examples
+
+Acceptance criteria:
+
+- API covers repository registration, analysis, search, and inspection
+- Responses are stable enough for agent/tooling integration
+- OpenAPI docs describe all MVP endpoints
+
+## 7. Initial Database Shape
+
+Start with tables for:
+
+- `repositories`
+- `repository_snapshots`
+- `analysis_runs`
+- `observed_facts`
+- `source_references`
+- `candidate_abilities`
+- `candidate_capabilities`
+- `candidate_features`
+- `candidate_evidence`
+- `candidate_links`
+- `approved_abilities`
+- `approved_capabilities`
+- `approved_features`
+- `approved_evidence`
+- `approved_links`
+- `review_decisions`
+- `content_chunks`
+- `embeddings`
+
+Use status fields consistently:
+
+```text
+registered
+ingesting
+analyzing
+analysis_failed
+analyzed
+reviewing
+indexed
+```
+
+## 8. Analyzer v0.1 Strategy
+
+The first analyzer should be intentionally modest.
+
+Deterministic scan:
+
+- Identify repo root metadata files
+- Identify docs, examples, tests, package manifests, API specs, config files
+- Detect languages from extensions and package files
+- Detect common frameworks from manifests
+- Detect likely API/CLI features using simple framework-specific scanners
+
+Content extraction:
+
+- README and docs first
+- Examples and tests second
+- Selected source files only when they expose interfaces
+- Preserve path and line references
+
+LLM extraction:
+
+- Use separate prompts for abilities, capabilities, features, and evidence
+- Request structured JSON
+- Require source references for each candidate
+- Reject or mark speculative any candidate without supporting sources
+
+Confidence scoring:
+
+- Start from the documented additive model
+- Normalize to `0.0-1.0`
+- Store both numeric confidence and label
+
+## 9. UI Workplan
+
+Build application screens in this order:
+
+1. Repository list
+2. Repository registration
+3. Repository detail and analysis status
+4. Deterministic scan summary
+5. Candidate review tree
+6. Published repository profile
+7. Search
+
+The UI should feel like an operational tool rather than a marketing site: dense, clear, review-focused, and optimized for repeated curator work.
+
+## 10. Testing Strategy
+
+Add tests around the highest-risk boundaries:
+
+- Database migrations and model relationships
+- Git URL validation
+- Scanner output for fixture repositories
+- Candidate graph validation
+- Review workflow transitions
+- Search result ranking and filtering
+- API contract tests for MVP endpoints
+
+Create small fixture repositories for:
+
+- README-only repository
+- Python CLI repository
+- FastAPI repository
+- JavaScript/TypeScript package
+- Repository with tests and examples
+- Repository with weak or misleading docs
+
+## 11. Key Risks and Mitigations
+
+Extraction quality risk:
+
+- Require source references.
+- Keep candidates reviewable.
+- Separate observed facts from interpreted claims.
+
+Over-complex ontology risk:
+
+- Keep v0.1 schema minimal.
+- Avoid enforcing deep taxonomy too early.
+
+Search quality risk:
+
+- Combine relational filters, full-text search, and vector search.
+- Show why a result matched.
+
+Operational complexity risk:
+
+- Start as a modular monolith.
+- Use simple jobs before adding worker infrastructure.
+
+Trust risk:
+
+- Never publish unapproved claims as canonical truth.
+- Preserve analysis run history and review decisions.
+
+## 12. Immediate Next Actions
+
+Recommended next implementation sequence:
+
+1. Scaffold the FastAPI application, database migrations, and test harness.
+2. Implement the core schema for repositories, snapshots, analysis runs, observed facts, candidates, and approved entries.
+3. Add manual registry CRUD and ability-map API.
+4. Build a minimal repository list/profile UI.
+5. Add Git ingestion and deterministic scanning.
+6. Add candidate graph generation and review workflow.
+
+The first meaningful demo should be:
+
+```text
+Create a repository
+Add or generate an ability map
+Approve it
+Search for a capability
+Open the repository profile
+Drill down to feature and evidence locations
+```