Files
repo-scoping/wiki/AbilityExtractionHeuristics.md

823 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
AbilityExtractionHeuristics
*How repositories will be explored*
# Ability / Capability Extraction Heuristics v0.1
## Repository Scoping
## 1. Purpose
The extraction engine should answer:
> “What is this repository useful for, what bounded behaviors does it provide, and where are those behaviors implemented?”
It should produce **candidate entries**, not final truth. Human/agent review remains part of the workflow.
---
# 2. Extraction Layers
```text
Ability → usefulness / problem class
Capability → bounded behavior
Feature → concrete interface or implementation
Evidence → reason to believe the claim
```
---
# 3. Source Priority
Not all repository signals are equally trustworthy.
## Priority 1 — High Trust
Use these first:
```text
README
docs/
examples/
tests/
API specs
CLI help
package metadata
```
These usually express intended usage.
## Priority 2 — Medium Trust
```text
module names
function names
class names
route names
config files
workflow files
```
These show implemented structure.
## Priority 3 — Low Trust
```text
comments
commit messages
dependency names
directory names alone
```
Useful as supporting signals, but not enough by themselves.
---
# 4. Ability Extraction Heuristics
Abilities describe **why the repository is useful**.
## 4.1 Ability Signal Patterns
Look for phrases like:
```text
"helps users..."
"enables..."
"automates..."
"provides a way to..."
"used for..."
"designed to..."
"allows..."
"supports..."
```
Example:
```text
"This library helps route incoming business emails."
```
Candidate ability:
```yaml
name: Business Email Routing
```
---
## 4.2 Ability Naming Rule
Ability names should be:
```text
Domain + Problem Class
```
Good:
```text
Business Email Routing
Document Classification
Invoice Data Extraction
Kubernetes Deployment Inspection
Agent Workflow Orchestration
```
Bad:
```text
Fast API
Email Button
Classifier
Uses GPT
```
---
## 4.3 Ability Extraction Sources
Best sources for abilities:
```text
README intro
project tagline
docs overview
examples index
package description
```
Ability is usually described in prose, not code.
---
## 4.4 Ability Confidence
Assign confidence based on signal quality:
```yaml
confidence:
high:
- explicitly stated in README/docs
- supported by examples
- supported by tests or APIs
medium:
- inferred from multiple capabilities/features
- visible in examples but not stated
low:
- inferred from names only
- based on dependencies or folder structure
```
---
# 5. Capability Extraction Heuristics
Capabilities describe **bounded behavior**.
## 5.1 Capability Signal Patterns
Look for verbs applied to objects:
```text
classify email
extract invoice data
summarize document
validate schema
generate response
deploy service
monitor cluster
route ticket
convert format
```
Pattern:
```text
Verb + Object
```
Examples:
```text
Classify Email Intent
Extract Invoice Metadata
Generate Routing Explanation
Validate Repository Metadata
```
---
## 5.2 Capability Naming Rule
Capability names should be:
```text
Action Verb + Domain Object
```
Good:
```text
Classify Incoming Email
Extract PDF Metadata
Generate API Client
Validate Kubernetes Manifest
Detect Broken Links
```
Bad:
```text
Email Capability
Parser
Smart Document Stuff
Endpoint
```
---
## 5.3 Capability Sources
Best sources:
```text
API route names
CLI commands
public functions
service classes
tests
examples
docs tutorials
```
Capability is often visible in code and tests.
---
## 5.4 Capability Boundary Rule
A capability should be small enough to test.
Good:
```text
Extract invoice date from PDF
Classify email into intent category
Generate markdown from DOCX
```
Too broad:
```text
Manage documents
Automate business
Understand everything
```
Too narrow:
```text
Read config variable
Call helper function
Trim whitespace
```
Rule of thumb:
> If you can write a meaningful acceptance test for it, it is probably a capability.
---
# 6. Feature Extraction Heuristics
Features describe **how the capability is exposed or implemented**.
## 6.1 Feature Signal Patterns
Look for concrete affordances:
```text
REST endpoint
CLI command
UI component
configuration option
SDK method
background job
database migration
import/export format
plugin hook
```
Examples:
```yaml
features:
- name: /classify-email endpoint
- name: classify-email CLI command
- name: department-rules.yaml config
- name: JSON result export
```
---
## 6.2 Feature Naming Rule
Feature names should be concrete and inspectable.
Good:
```text
POST /api/classify-email
classify-email CLI command
Rule Configuration File
PDF Upload Component
```
Bad:
```text
AI routing
Document understanding
Magic extraction
```
---
# 7. Evidence Extraction Heuristics
Evidence supports claims.
## 7.1 Evidence Types
```yaml
evidence_types:
unit_test
integration_test
example
demo
benchmark
documentation
API specification
production usage note
manual review
```
---
## 7.2 Evidence Mapping
Map evidence to the nearest capability.
Example:
```text
tests/test_email_classifier.py
```
Supports:
```text
Classify Incoming Email
```
Example:
```text
examples/invoice_extraction_demo.py
```
Supports:
```text
Extract Invoice Metadata
```
---
## 7.3 Evidence Strength
```yaml
evidence_strength:
strong:
- automated tests
- benchmark results
- executable examples
- integration tests
medium:
- documentation
- tutorials
- screenshots
- sample output
weak:
- README claim only
- comments
- filename hints
```
---
# 8. AbilityCapabilityFeature Linking
## 8.1 Link Rule
```text
Ability explains why.
Capability explains what.
Feature explains how/where.
```
Example:
```yaml
ability:
name: Business Email Routing
capability:
name: Classify Incoming Email
supports:
- Business Email Routing
feature:
name: POST /api/classify-email
implements:
- Classify Incoming Email
```
---
## 8.2 Linking Heuristic
A capability supports an ability if:
```text
Removing the capability would weaken the repositorys ability to deliver that usefulness.
```
A feature implements a capability if:
```text
The feature is an interface, component, or code location through which the behavior is performed or exposed.
```
---
# 9. Confidence Scoring
Use a simple additive model first.
## 9.1 Candidate Confidence Factors
```yaml
confidence_factors:
explicit_doc_claim: +0.30
example_present: +0.20
test_present: +0.25
implementation_location_found: +0.15
api_or_cli_exposed: +0.15
multiple_source_agreement: +0.20
inferred_from_names_only: -0.25
no_evidence: -0.30
```
Normalize to:
```text
0.0 1.0
```
## 9.2 Confidence Labels
```yaml
0.80 - 1.00: high
0.50 - 0.79: medium
0.20 - 0.49: low
0.00 - 0.19: speculative
```
---
# 10. Classification Rules
## 10.1 Is it an Ability?
Ask:
```text
Would a user search for this as a desired outcome?
```
If yes, probably ability.
Example:
```text
“I need document classification.”
```
Ability.
---
## 10.2 Is it a Capability?
Ask:
```text
Can this behavior be tested with input/output expectations?
```
If yes, probably capability.
Example:
```text
“Classify document into category.”
```
Capability.
---
## 10.3 Is it a Feature?
Ask:
```text
Is this a concrete interface, option, component, or implementation artifact?
```
If yes, probably feature.
Example:
```text
“POST /api/classify-document”
```
Feature.
---
# 11. Anti-Heuristics
Things the extractor should avoid.
## 11.1 Do Not Treat Dependencies as Capabilities
Bad:
```yaml
capability: Uses OpenAI
```
Better:
```yaml
feature: OpenAI provider integration
capability: Generate Text Summary
```
---
## 11.2 Do Not Treat Technology as Ability
Bad:
```yaml
ability: FastAPI
```
Better:
```yaml
feature: FastAPI REST interface
```
---
## 11.3 Do Not Treat Internal Helpers as Capabilities
Bad:
```yaml
capability: Parse YAML Config
```
Unless parsing YAML config is a user-visible behavior.
---
## 11.4 Avoid Vendor-Hype Terms
Bad:
```text
intelligent automation
next-gen AI
enterprise-ready transformation
```
Convert into testable candidates:
```text
Classify Documents
Generate Reports
Route Tasks
```
---
# 12. Extraction Pipeline v0.1
## Step 1 — Repository Intake
Collect:
```text
README
docs
examples
tests
package files
source tree
API routes
CLI definitions
```
---
## Step 2 — Structural Summary
Produce:
```yaml
repository_summary:
languages: []
frameworks: []
interfaces: []
docs_found: []
tests_found: []
examples_found: []
```
---
## Step 3 — Candidate Ability Extraction
From README/docs/package descriptions.
Output:
```yaml
candidate_abilities:
- name
- description
- confidence
- supporting_sources
```
---
## Step 4 — Candidate Capability Extraction
From APIs, tests, examples, public modules.
Output:
```yaml
candidate_capabilities:
- name
- description
- inputs
- outputs
- linked_abilities
- confidence
- supporting_sources
```
---
## Step 5 — Candidate Feature Extraction
From endpoints, CLI commands, config files, UI components, modules.
Output:
```yaml
candidate_features:
- name
- type
- location
- linked_capabilities
- confidence
```
---
## Step 6 — Evidence Linking
Attach evidence:
```yaml
evidence:
- type
- path
- supports
- strength
```
---
## Step 7 — Review Package
Generate a curator-friendly review view:
```text
Ability
Capability
Feature
Evidence
```
---
# 13. Example Extraction
Given README:
```text
MailRouter helps companies automatically classify incoming emails and route them to the right department.
```
Given route:
```text
POST /api/classify-email
```
Given test:
```text
tests/test_email_classification.py
```
Output:
```yaml
abilities:
- id: ability.business_email_routing
name: Business Email Routing
confidence: 0.9
capabilities:
- id: capability.classify_incoming_email
name: Classify Incoming Email
ability_refs:
- ability.business_email_routing
confidence: 0.85
features:
- id: feature.classify_email_endpoint
name: POST /api/classify-email
type: REST endpoint
location: src/routes/classify_email.py
capability_refs:
- capability.classify_incoming_email
evidence:
- type: unit_test
path: tests/test_email_classification.py
supports:
- capability.classify_incoming_email
strength: strong
```
---
# 14. MVP Principle
The extractor should be:
```text
conservative
explainable
reviewable
source-linked
```
Not magical.
The best first version is not the one that extracts everything.
It is the one where the user says:
> “Yes, I understand why the system proposed this.”
xxx