11 KiB
AbilityExtractionHeuristics
How repositories will be explored
Ability / Capability Extraction Heuristics v0.1
Repository Scoping
1. Purpose
The extraction engine should answer:
“What is this repository useful for, what bounded behaviors does it provide, and where are those behaviors implemented?”
It should produce candidate entries, not final truth. Human/agent review remains part of the workflow.
2. Extraction Layers
Ability → usefulness / problem class
Capability → bounded behavior
Feature → concrete interface or implementation
Evidence → reason to believe the claim
3. Source Priority
Not all repository signals are equally trustworthy.
Priority 1 — High Trust
Use these first:
README
docs/
examples/
tests/
API specs
CLI help
package metadata
These usually express intended usage.
Priority 2 — Medium Trust
module names
function names
class names
route names
config files
workflow files
These show implemented structure.
Priority 3 — Low Trust
comments
commit messages
dependency names
directory names alone
Useful as supporting signals, but not enough by themselves.
4. Ability Extraction Heuristics
Abilities describe why the repository is useful.
4.1 Ability Signal Patterns
Look for phrases like:
"helps users..."
"enables..."
"automates..."
"provides a way to..."
"used for..."
"designed to..."
"allows..."
"supports..."
Example:
"This library helps route incoming business emails."
Candidate ability:
name: Business Email Routing
4.2 Ability Naming Rule
Ability names should be:
Domain + Problem Class
Good:
Business Email Routing
Document Classification
Invoice Data Extraction
Kubernetes Deployment Inspection
Agent Workflow Orchestration
Bad:
Fast API
Email Button
Classifier
Uses GPT
4.3 Ability Extraction Sources
Best sources for abilities:
README intro
project tagline
docs overview
examples index
package description
Ability is usually described in prose, not code.
4.4 Ability Confidence
Assign confidence based on signal quality:
confidence:
high:
- explicitly stated in README/docs
- supported by examples
- supported by tests or APIs
medium:
- inferred from multiple capabilities/features
- visible in examples but not stated
low:
- inferred from names only
- based on dependencies or folder structure
5. Capability Extraction Heuristics
Capabilities describe bounded behavior.
5.1 Capability Signal Patterns
Look for verbs applied to objects:
classify email
extract invoice data
summarize document
validate schema
generate response
deploy service
monitor cluster
route ticket
convert format
Pattern:
Verb + Object
Examples:
Classify Email Intent
Extract Invoice Metadata
Generate Routing Explanation
Validate Repository Metadata
5.2 Capability Naming Rule
Capability names should be:
Action Verb + Domain Object
Good:
Classify Incoming Email
Extract PDF Metadata
Generate API Client
Validate Kubernetes Manifest
Detect Broken Links
Bad:
Email Capability
Parser
Smart Document Stuff
Endpoint
5.3 Capability Sources
Best sources:
API route names
CLI commands
public functions
service classes
tests
examples
docs tutorials
Capability is often visible in code and tests.
5.4 Capability Boundary Rule
A capability should be small enough to test.
Good:
Extract invoice date from PDF
Classify email into intent category
Generate markdown from DOCX
Too broad:
Manage documents
Automate business
Understand everything
Too narrow:
Read config variable
Call helper function
Trim whitespace
Rule of thumb:
If you can write a meaningful acceptance test for it, it is probably a capability.
6. Feature Extraction Heuristics
Features describe how the capability is exposed or implemented.
6.1 Feature Signal Patterns
Look for concrete affordances:
REST endpoint
CLI command
UI component
configuration option
SDK method
background job
database migration
import/export format
plugin hook
Examples:
features:
- name: /classify-email endpoint
- name: classify-email CLI command
- name: department-rules.yaml config
- name: JSON result export
6.2 Feature Naming Rule
Feature names should be concrete and inspectable.
Good:
POST /api/classify-email
classify-email CLI command
Rule Configuration File
PDF Upload Component
Bad:
AI routing
Document understanding
Magic extraction
7. Evidence Extraction Heuristics
Evidence supports claims.
7.1 Evidence Types
evidence_types:
unit_test
integration_test
example
demo
benchmark
documentation
API specification
production usage note
manual review
7.2 Evidence Mapping
Map evidence to the nearest capability.
Example:
tests/test_email_classifier.py
Supports:
Classify Incoming Email
Example:
examples/invoice_extraction_demo.py
Supports:
Extract Invoice Metadata
7.3 Evidence Strength
evidence_strength:
strong:
- automated tests
- benchmark results
- executable examples
- integration tests
medium:
- documentation
- tutorials
- screenshots
- sample output
weak:
- README claim only
- comments
- filename hints
8. Ability–Capability–Feature Linking
8.1 Link Rule
Ability explains why.
Capability explains what.
Feature explains how/where.
Example:
ability:
name: Business Email Routing
capability:
name: Classify Incoming Email
supports:
- Business Email Routing
feature:
name: POST /api/classify-email
implements:
- Classify Incoming Email
8.2 Linking Heuristic
A capability supports an ability if:
Removing the capability would weaken the repository’s ability to deliver that usefulness.
A feature implements a capability if:
The feature is an interface, component, or code location through which the behavior is performed or exposed.
9. Confidence Scoring
Use a simple additive model first.
9.1 Candidate Confidence Factors
confidence_factors:
explicit_doc_claim: +0.30
example_present: +0.20
test_present: +0.25
implementation_location_found: +0.15
api_or_cli_exposed: +0.15
multiple_source_agreement: +0.20
inferred_from_names_only: -0.25
no_evidence: -0.30
Normalize to:
0.0 – 1.0
9.2 Confidence Labels
0.80 - 1.00: high
0.50 - 0.79: medium
0.20 - 0.49: low
0.00 - 0.19: speculative
10. Classification Rules
10.1 Is it an Ability?
Ask:
Would a user search for this as a desired outcome?
If yes, probably ability.
Example:
“I need document classification.”
Ability.
10.2 Is it a Capability?
Ask:
Can this behavior be tested with input/output expectations?
If yes, probably capability.
Example:
“Classify document into category.”
Capability.
10.3 Is it a Feature?
Ask:
Is this a concrete interface, option, component, or implementation artifact?
If yes, probably feature.
Example:
“POST /api/classify-document”
Feature.
11. Anti-Heuristics
Things the extractor should avoid.
11.1 Do Not Treat Dependencies as Capabilities
Bad:
capability: Uses OpenAI
Better:
feature: OpenAI provider integration
capability: Generate Text Summary
11.2 Do Not Treat Technology as Ability
Bad:
ability: FastAPI
Better:
feature: FastAPI REST interface
11.3 Do Not Treat Internal Helpers as Capabilities
Bad:
capability: Parse YAML Config
Unless parsing YAML config is a user-visible behavior.
11.4 Avoid Vendor-Hype Terms
Bad:
intelligent automation
next-gen AI
enterprise-ready transformation
Convert into testable candidates:
Classify Documents
Generate Reports
Route Tasks
12. Extraction Pipeline v0.1
Step 1 — Repository Intake
Collect:
README
docs
examples
tests
package files
source tree
API routes
CLI definitions
Step 2 — Structural Summary
Produce:
repository_summary:
languages: []
frameworks: []
interfaces: []
docs_found: []
tests_found: []
examples_found: []
Step 3 — Candidate Ability Extraction
From README/docs/package descriptions.
Output:
candidate_abilities:
- name
- description
- confidence
- supporting_sources
Step 4 — Candidate Capability Extraction
From APIs, tests, examples, public modules.
Output:
candidate_capabilities:
- name
- description
- inputs
- outputs
- linked_abilities
- confidence
- supporting_sources
Step 5 — Candidate Feature Extraction
From endpoints, CLI commands, config files, UI components, modules.
Output:
candidate_features:
- name
- type
- location
- linked_capabilities
- confidence
Step 6 — Evidence Linking
Attach evidence:
evidence:
- type
- path
- supports
- strength
Step 7 — Review Package
Generate a curator-friendly review view:
Ability
Capability
Feature
Evidence
13. Example Extraction
Given README:
MailRouter helps companies automatically classify incoming emails and route them to the right department.
Given route:
POST /api/classify-email
Given test:
tests/test_email_classification.py
Output:
abilities:
- id: ability.business_email_routing
name: Business Email Routing
confidence: 0.9
capabilities:
- id: capability.classify_incoming_email
name: Classify Incoming Email
ability_refs:
- ability.business_email_routing
confidence: 0.85
features:
- id: feature.classify_email_endpoint
name: POST /api/classify-email
type: REST endpoint
location: src/routes/classify_email.py
capability_refs:
- capability.classify_incoming_email
evidence:
- type: unit_test
path: tests/test_email_classification.py
supports:
- capability.classify_incoming_email
strength: strong
14. MVP Principle
The extractor should be:
conservative
explainable
reviewable
source-linked
Not magical.
The best first version is not the one that extracts everything.
It is the one where the user says:
“Yes, I understand why the system proposed this.”
xxx