repo-scoping/wiki/AbilityExtractionHeuristics.md

AbilityExtractionHeuristics

*How repositories will be explored*

# Ability / Capability Extraction Heuristics v0.1

## Repository Scoping

## 1. Purpose

The extraction engine should answer:

> “What is this repository useful for, what bounded behaviors does it provide, and where are those behaviors implemented?”

It should produce **candidate entries**, not final truth. Human/agent review remains part of the workflow.

---

# 2. Extraction Layers

```text
Ability      → usefulness / problem class
Capability   → bounded behavior
Feature      → concrete interface or implementation
Evidence     → reason to believe the claim
```

---

# 3. Source Priority

Not all repository signals are equally trustworthy.

## Priority 1 — High Trust

Use these first:

```text
README
docs/
examples/
tests/
API specs
CLI help
package metadata
```

These usually express intended usage.

## Priority 2 — Medium Trust

```text
module names
function names
class names
route names
config files
workflow files
```

These show implemented structure.

## Priority 3 — Low Trust

```text
comments
commit messages
dependency names
directory names alone
```

Useful as supporting signals, but not enough by themselves.

---

# 4. Ability Extraction Heuristics

Abilities describe **why the repository is useful**.

## 4.1 Ability Signal Patterns

Look for phrases like:

```text
"helps users..."
"enables..."
"automates..."
"provides a way to..."
"used for..."
"designed to..."
"allows..."
"supports..."
```

Example:

```text
"This library helps route incoming business emails."
```

Candidate ability:

```yaml
name: Business Email Routing
```

---

## 4.2 Ability Naming Rule

Ability names should be:

```text
Domain + Problem Class
```

Good:

```text
Business Email Routing
Document Classification
Invoice Data Extraction
Kubernetes Deployment Inspection
Agent Workflow Orchestration
```

Bad:

```text
Fast API
Email Button
Classifier
Uses GPT
```

---

## 4.3 Ability Extraction Sources

Best sources for abilities:

```text
README intro
project tagline
docs overview
examples index
package description
```

Ability is usually described in prose, not code.

---

## 4.4 Ability Confidence

Assign confidence based on signal quality:

```yaml
confidence:
  high:
    - explicitly stated in README/docs
    - supported by examples
    - supported by tests or APIs

  medium:
    - inferred from multiple capabilities/features
    - visible in examples but not stated

  low:
    - inferred from names only
    - based on dependencies or folder structure
```

---

# 5. Capability Extraction Heuristics

Capabilities describe **bounded behavior**.

## 5.1 Capability Signal Patterns

Look for verbs applied to objects:

```text
classify email
extract invoice data
summarize document
validate schema
generate response
deploy service
monitor cluster
route ticket
convert format
```

Pattern:

```text
Verb + Object
```

Examples:

```text
Classify Email Intent
Extract Invoice Metadata
Generate Routing Explanation
Validate Repository Metadata
```

---

## 5.2 Capability Naming Rule

Capability names should be:

```text
Action Verb + Domain Object
```

Good:

```text
Classify Incoming Email
Extract PDF Metadata
Generate API Client
Validate Kubernetes Manifest
Detect Broken Links
```

Bad:

```text
Email Capability
Parser
Smart Document Stuff
Endpoint
```

---

## 5.3 Capability Sources

Best sources:

```text
API route names
CLI commands
public functions
service classes
tests
examples
docs tutorials
```

Capability is often visible in code and tests.

---

## 5.4 Capability Boundary Rule

A capability should be small enough to test.

Good:

```text
Extract invoice date from PDF
Classify email into intent category
Generate markdown from DOCX
```

Too broad:

```text
Manage documents
Automate business
Understand everything
```

Too narrow:

```text
Read config variable
Call helper function
Trim whitespace
```

Rule of thumb:

> If you can write a meaningful acceptance test for it, it is probably a capability.

---

# 6. Feature Extraction Heuristics

Features describe **how the capability is exposed or implemented**.

## 6.1 Feature Signal Patterns

Look for concrete affordances:

```text
REST endpoint
CLI command
UI component
configuration option
SDK method
background job
database migration
import/export format
plugin hook
```

Examples:

```yaml
features:
  - name: /classify-email endpoint
  - name: classify-email CLI command
  - name: department-rules.yaml config
  - name: JSON result export
```

---

## 6.2 Feature Naming Rule

Feature names should be concrete and inspectable.

Good:

```text
POST /api/classify-email
classify-email CLI command
Rule Configuration File
PDF Upload Component
```

Bad:

```text
AI routing
Document understanding
Magic extraction
```

---

# 7. Evidence Extraction Heuristics

Evidence supports claims.

## 7.1 Evidence Types

```yaml
evidence_types:
  unit_test
  integration_test
  example
  demo
  benchmark
  documentation
  API specification
  production usage note
  manual review
```

---

## 7.2 Evidence Mapping

Map evidence to the nearest capability.

Example:

```text
tests/test_email_classifier.py
```

Supports:

```text
Classify Incoming Email
```

Example:

```text
examples/invoice_extraction_demo.py
```

Supports:

```text
Extract Invoice Metadata
```

---

## 7.3 Evidence Strength

```yaml
evidence_strength:
  strong:
    - automated tests
    - benchmark results
    - executable examples
    - integration tests

  medium:
    - documentation
    - tutorials
    - screenshots
    - sample output

  weak:
    - README claim only
    - comments
    - filename hints
```

---

# 8. Ability–Capability–Feature Linking

## 8.1 Link Rule

```text
Ability explains why.
Capability explains what.
Feature explains how/where.
```

Example:

```yaml
ability:
  name: Business Email Routing

capability:
  name: Classify Incoming Email
  supports:
    - Business Email Routing

feature:
  name: POST /api/classify-email
  implements:
    - Classify Incoming Email
```

---

## 8.2 Linking Heuristic

A capability supports an ability if:

```text
Removing the capability would weaken the repository’s ability to deliver that usefulness.
```

A feature implements a capability if:

```text
The feature is an interface, component, or code location through which the behavior is performed or exposed.
```

---

# 9. Confidence Scoring

Use a simple additive model first.

## 9.1 Candidate Confidence Factors

```yaml
confidence_factors:
  explicit_doc_claim: +0.30
  example_present: +0.20
  test_present: +0.25
  implementation_location_found: +0.15
  api_or_cli_exposed: +0.15
  multiple_source_agreement: +0.20
  inferred_from_names_only: -0.25
  no_evidence: -0.30
```

Normalize to:

```text
0.0 – 1.0
```

## 9.2 Confidence Labels

```yaml
0.80 - 1.00: high
0.50 - 0.79: medium
0.20 - 0.49: low
0.00 - 0.19: speculative
```

---

# 10. Classification Rules

## 10.1 Is it an Ability?

Ask:

```text
Would a user search for this as a desired outcome?
```

If yes, probably ability.

Example:

```text
“I need document classification.”
```

Ability.

---

## 10.2 Is it a Capability?

Ask:

```text
Can this behavior be tested with input/output expectations?
```

If yes, probably capability.

Example:

```text
“Classify document into category.”
```

Capability.

---

## 10.3 Is it a Feature?

Ask:

```text
Is this a concrete interface, option, component, or implementation artifact?
```

If yes, probably feature.

Example:

```text
“POST /api/classify-document”
```

Feature.

---

# 11. Anti-Heuristics

Things the extractor should avoid.

## 11.1 Do Not Treat Dependencies as Capabilities

Bad:

```yaml
capability: Uses OpenAI
```

Better:

```yaml
feature: OpenAI provider integration
capability: Generate Text Summary
```

---

## 11.2 Do Not Treat Technology as Ability

Bad:

```yaml
ability: FastAPI
```

Better:

```yaml
feature: FastAPI REST interface
```

---

## 11.3 Do Not Treat Internal Helpers as Capabilities

Bad:

```yaml
capability: Parse YAML Config
```

Unless parsing YAML config is a user-visible behavior.

---

## 11.4 Avoid Vendor-Hype Terms

Bad:

```text
intelligent automation
next-gen AI
enterprise-ready transformation
```

Convert into testable candidates:

```text
Classify Documents
Generate Reports
Route Tasks
```

---

# 12. Extraction Pipeline v0.1

## Step 1 — Repository Intake

Collect:

```text
README
docs
examples
tests
package files
source tree
API routes
CLI definitions
```

---

## Step 2 — Structural Summary

Produce:

```yaml
repository_summary:
  languages: []
  frameworks: []
  interfaces: []
  docs_found: []
  tests_found: []
  examples_found: []
```

---

## Step 3 — Candidate Ability Extraction

From README/docs/package descriptions.

Output:

```yaml
candidate_abilities:
  - name
  - description
  - confidence
  - supporting_sources
```

---

## Step 4 — Candidate Capability Extraction

From APIs, tests, examples, public modules.

Output:

```yaml
candidate_capabilities:
  - name
  - description
  - inputs
  - outputs
  - linked_abilities
  - confidence
  - supporting_sources
```

---

## Step 5 — Candidate Feature Extraction

From endpoints, CLI commands, config files, UI components, modules.

Output:

```yaml
candidate_features:
  - name
  - type
  - location
  - linked_capabilities
  - confidence
```

---

## Step 6 — Evidence Linking

Attach evidence:

```yaml
evidence:
  - type
  - path
  - supports
  - strength
```

---

## Step 7 — Review Package

Generate a curator-friendly review view:

```text
Ability
  Capability
    Feature
    Evidence
```

---

# 13. Example Extraction

Given README:

```text
MailRouter helps companies automatically classify incoming emails and route them to the right department.
```

Given route:

```text
POST /api/classify-email
```

Given test:

```text
tests/test_email_classification.py
```

Output:

```yaml
abilities:
  - id: ability.business_email_routing
    name: Business Email Routing
    confidence: 0.9

capabilities:
  - id: capability.classify_incoming_email
    name: Classify Incoming Email
    ability_refs:
      - ability.business_email_routing
    confidence: 0.85

features:
  - id: feature.classify_email_endpoint
    name: POST /api/classify-email
    type: REST endpoint
    location: src/routes/classify_email.py
    capability_refs:
      - capability.classify_incoming_email

evidence:
  - type: unit_test
    path: tests/test_email_classification.py
    supports:
      - capability.classify_incoming_email
    strength: strong
```

---

# 14. MVP Principle

The extractor should be:

```text
conservative
explainable
reviewable
source-linked
```

Not magical.

The best first version is not the one that extracts everything.

It is the one where the user says:

> “Yes, I understand why the system proposed this.”


xxx