AbilityExtractionHeuristics *How repositories will be explored* # Ability / Capability Extraction Heuristics v0.1 ## Repository Scoping ## 1. Purpose The extraction engine should answer: > “What is this repository useful for, what bounded behaviors does it provide, and where are those behaviors implemented?” It should produce **candidate entries**, not final truth. Human/agent review remains part of the workflow. --- # 2. Extraction Layers ```text Ability → usefulness / problem class Capability → bounded behavior Feature → concrete interface or implementation Evidence → reason to believe the claim ``` --- # 3. Source Priority Not all repository signals are equally trustworthy. ## Priority 1 — High Trust Use these first: ```text README docs/ examples/ tests/ API specs CLI help package metadata ``` These usually express intended usage. ## Priority 2 — Medium Trust ```text module names function names class names route names config files workflow files ``` These show implemented structure. ## Priority 3 — Low Trust ```text comments commit messages dependency names directory names alone ``` Useful as supporting signals, but not enough by themselves. --- # 4. Ability Extraction Heuristics Abilities describe **why the repository is useful**. ## 4.1 Ability Signal Patterns Look for phrases like: ```text "helps users..." "enables..." "automates..." "provides a way to..." "used for..." "designed to..." "allows..." "supports..." ``` Example: ```text "This library helps route incoming business emails." ``` Candidate ability: ```yaml name: Business Email Routing ``` --- ## 4.2 Ability Naming Rule Ability names should be: ```text Domain + Problem Class ``` Good: ```text Business Email Routing Document Classification Invoice Data Extraction Kubernetes Deployment Inspection Agent Workflow Orchestration ``` Bad: ```text Fast API Email Button Classifier Uses GPT ``` --- ## 4.3 Ability Extraction Sources Best sources for abilities: ```text README intro project tagline docs overview examples index package description ``` Ability is usually described in prose, not code. --- ## 4.4 Ability Confidence Assign confidence based on signal quality: ```yaml confidence: high: - explicitly stated in README/docs - supported by examples - supported by tests or APIs medium: - inferred from multiple capabilities/features - visible in examples but not stated low: - inferred from names only - based on dependencies or folder structure ``` --- # 5. Capability Extraction Heuristics Capabilities describe **bounded behavior**. ## 5.1 Capability Signal Patterns Look for verbs applied to objects: ```text classify email extract invoice data summarize document validate schema generate response deploy service monitor cluster route ticket convert format ``` Pattern: ```text Verb + Object ``` Examples: ```text Classify Email Intent Extract Invoice Metadata Generate Routing Explanation Validate Repository Metadata ``` --- ## 5.2 Capability Naming Rule Capability names should be: ```text Action Verb + Domain Object ``` Good: ```text Classify Incoming Email Extract PDF Metadata Generate API Client Validate Kubernetes Manifest Detect Broken Links ``` Bad: ```text Email Capability Parser Smart Document Stuff Endpoint ``` --- ## 5.3 Capability Sources Best sources: ```text API route names CLI commands public functions service classes tests examples docs tutorials ``` Capability is often visible in code and tests. --- ## 5.4 Capability Boundary Rule A capability should be small enough to test. Good: ```text Extract invoice date from PDF Classify email into intent category Generate markdown from DOCX ``` Too broad: ```text Manage documents Automate business Understand everything ``` Too narrow: ```text Read config variable Call helper function Trim whitespace ``` Rule of thumb: > If you can write a meaningful acceptance test for it, it is probably a capability. --- # 6. Feature Extraction Heuristics Features describe **how the capability is exposed or implemented**. ## 6.1 Feature Signal Patterns Look for concrete affordances: ```text REST endpoint CLI command UI component configuration option SDK method background job database migration import/export format plugin hook ``` Examples: ```yaml features: - name: /classify-email endpoint - name: classify-email CLI command - name: department-rules.yaml config - name: JSON result export ``` --- ## 6.2 Feature Naming Rule Feature names should be concrete and inspectable. Good: ```text POST /api/classify-email classify-email CLI command Rule Configuration File PDF Upload Component ``` Bad: ```text AI routing Document understanding Magic extraction ``` --- # 7. Evidence Extraction Heuristics Evidence supports claims. ## 7.1 Evidence Types ```yaml evidence_types: unit_test integration_test example demo benchmark documentation API specification production usage note manual review ``` --- ## 7.2 Evidence Mapping Map evidence to the nearest capability. Example: ```text tests/test_email_classifier.py ``` Supports: ```text Classify Incoming Email ``` Example: ```text examples/invoice_extraction_demo.py ``` Supports: ```text Extract Invoice Metadata ``` --- ## 7.3 Evidence Strength ```yaml evidence_strength: strong: - automated tests - benchmark results - executable examples - integration tests medium: - documentation - tutorials - screenshots - sample output weak: - README claim only - comments - filename hints ``` --- # 8. Ability–Capability–Feature Linking ## 8.1 Link Rule ```text Ability explains why. Capability explains what. Feature explains how/where. ``` Example: ```yaml ability: name: Business Email Routing capability: name: Classify Incoming Email supports: - Business Email Routing feature: name: POST /api/classify-email implements: - Classify Incoming Email ``` --- ## 8.2 Linking Heuristic A capability supports an ability if: ```text Removing the capability would weaken the repository’s ability to deliver that usefulness. ``` A feature implements a capability if: ```text The feature is an interface, component, or code location through which the behavior is performed or exposed. ``` --- # 9. Confidence Scoring Use a simple additive model first. ## 9.1 Candidate Confidence Factors ```yaml confidence_factors: explicit_doc_claim: +0.30 example_present: +0.20 test_present: +0.25 implementation_location_found: +0.15 api_or_cli_exposed: +0.15 multiple_source_agreement: +0.20 inferred_from_names_only: -0.25 no_evidence: -0.30 ``` Normalize to: ```text 0.0 – 1.0 ``` ## 9.2 Confidence Labels ```yaml 0.80 - 1.00: high 0.50 - 0.79: medium 0.20 - 0.49: low 0.00 - 0.19: speculative ``` --- # 10. Classification Rules ## 10.1 Is it an Ability? Ask: ```text Would a user search for this as a desired outcome? ``` If yes, probably ability. Example: ```text “I need document classification.” ``` Ability. --- ## 10.2 Is it a Capability? Ask: ```text Can this behavior be tested with input/output expectations? ``` If yes, probably capability. Example: ```text “Classify document into category.” ``` Capability. --- ## 10.3 Is it a Feature? Ask: ```text Is this a concrete interface, option, component, or implementation artifact? ``` If yes, probably feature. Example: ```text “POST /api/classify-document” ``` Feature. --- # 11. Anti-Heuristics Things the extractor should avoid. ## 11.1 Do Not Treat Dependencies as Capabilities Bad: ```yaml capability: Uses OpenAI ``` Better: ```yaml feature: OpenAI provider integration capability: Generate Text Summary ``` --- ## 11.2 Do Not Treat Technology as Ability Bad: ```yaml ability: FastAPI ``` Better: ```yaml feature: FastAPI REST interface ``` --- ## 11.3 Do Not Treat Internal Helpers as Capabilities Bad: ```yaml capability: Parse YAML Config ``` Unless parsing YAML config is a user-visible behavior. --- ## 11.4 Avoid Vendor-Hype Terms Bad: ```text intelligent automation next-gen AI enterprise-ready transformation ``` Convert into testable candidates: ```text Classify Documents Generate Reports Route Tasks ``` --- # 12. Extraction Pipeline v0.1 ## Step 1 — Repository Intake Collect: ```text README docs examples tests package files source tree API routes CLI definitions ``` --- ## Step 2 — Structural Summary Produce: ```yaml repository_summary: languages: [] frameworks: [] interfaces: [] docs_found: [] tests_found: [] examples_found: [] ``` --- ## Step 3 — Candidate Ability Extraction From README/docs/package descriptions. Output: ```yaml candidate_abilities: - name - description - confidence - supporting_sources ``` --- ## Step 4 — Candidate Capability Extraction From APIs, tests, examples, public modules. Output: ```yaml candidate_capabilities: - name - description - inputs - outputs - linked_abilities - confidence - supporting_sources ``` --- ## Step 5 — Candidate Feature Extraction From endpoints, CLI commands, config files, UI components, modules. Output: ```yaml candidate_features: - name - type - location - linked_capabilities - confidence ``` --- ## Step 6 — Evidence Linking Attach evidence: ```yaml evidence: - type - path - supports - strength ``` --- ## Step 7 — Review Package Generate a curator-friendly review view: ```text Ability Capability Feature Evidence ``` --- # 13. Example Extraction Given README: ```text MailRouter helps companies automatically classify incoming emails and route them to the right department. ``` Given route: ```text POST /api/classify-email ``` Given test: ```text tests/test_email_classification.py ``` Output: ```yaml abilities: - id: ability.business_email_routing name: Business Email Routing confidence: 0.9 capabilities: - id: capability.classify_incoming_email name: Classify Incoming Email ability_refs: - ability.business_email_routing confidence: 0.85 features: - id: feature.classify_email_endpoint name: POST /api/classify-email type: REST endpoint location: src/routes/classify_email.py capability_refs: - capability.classify_incoming_email evidence: - type: unit_test path: tests/test_email_classification.py supports: - capability.classify_incoming_email strength: strong ``` --- # 14. MVP Principle The extractor should be: ```text conservative explainable reviewable source-linked ``` Not magical. The best first version is not the one that extracts everything. It is the one where the user says: > “Yes, I understand why the system proposed this.” xxx