Files
repo-scoping/wiki/AbilityExtractionHeuristics.md

11 KiB
Raw Blame History

AbilityExtractionHeuristics

How repositories will be explored

Ability / Capability Extraction Heuristics v0.1

Repository Scoping

1. Purpose

The extraction engine should answer:

“What is this repository useful for, what bounded behaviors does it provide, and where are those behaviors implemented?”

It should produce candidate entries, not final truth. Human/agent review remains part of the workflow.


2. Extraction Layers

Ability      → usefulness / problem class
Capability   → bounded behavior
Feature      → concrete interface or implementation
Evidence     → reason to believe the claim

3. Source Priority

Not all repository signals are equally trustworthy.

Priority 1 — High Trust

Use these first:

README
docs/
examples/
tests/
API specs
CLI help
package metadata

These usually express intended usage.

Priority 2 — Medium Trust

module names
function names
class names
route names
config files
workflow files

These show implemented structure.

Priority 3 — Low Trust

comments
commit messages
dependency names
directory names alone

Useful as supporting signals, but not enough by themselves.


4. Ability Extraction Heuristics

Abilities describe why the repository is useful.

4.1 Ability Signal Patterns

Look for phrases like:

"helps users..."
"enables..."
"automates..."
"provides a way to..."
"used for..."
"designed to..."
"allows..."
"supports..."

Example:

"This library helps route incoming business emails."

Candidate ability:

name: Business Email Routing

4.2 Ability Naming Rule

Ability names should be:

Domain + Problem Class

Good:

Business Email Routing
Document Classification
Invoice Data Extraction
Kubernetes Deployment Inspection
Agent Workflow Orchestration

Bad:

Fast API
Email Button
Classifier
Uses GPT

4.3 Ability Extraction Sources

Best sources for abilities:

README intro
project tagline
docs overview
examples index
package description

Ability is usually described in prose, not code.


4.4 Ability Confidence

Assign confidence based on signal quality:

confidence:
  high:
    - explicitly stated in README/docs
    - supported by examples
    - supported by tests or APIs

  medium:
    - inferred from multiple capabilities/features
    - visible in examples but not stated

  low:
    - inferred from names only
    - based on dependencies or folder structure

5. Capability Extraction Heuristics

Capabilities describe bounded behavior.

5.1 Capability Signal Patterns

Look for verbs applied to objects:

classify email
extract invoice data
summarize document
validate schema
generate response
deploy service
monitor cluster
route ticket
convert format

Pattern:

Verb + Object

Examples:

Classify Email Intent
Extract Invoice Metadata
Generate Routing Explanation
Validate Repository Metadata

5.2 Capability Naming Rule

Capability names should be:

Action Verb + Domain Object

Good:

Classify Incoming Email
Extract PDF Metadata
Generate API Client
Validate Kubernetes Manifest
Detect Broken Links

Bad:

Email Capability
Parser
Smart Document Stuff
Endpoint

5.3 Capability Sources

Best sources:

API route names
CLI commands
public functions
service classes
tests
examples
docs tutorials

Capability is often visible in code and tests.


5.4 Capability Boundary Rule

A capability should be small enough to test.

Good:

Extract invoice date from PDF
Classify email into intent category
Generate markdown from DOCX

Too broad:

Manage documents
Automate business
Understand everything

Too narrow:

Read config variable
Call helper function
Trim whitespace

Rule of thumb:

If you can write a meaningful acceptance test for it, it is probably a capability.


6. Feature Extraction Heuristics

Features describe how the capability is exposed or implemented.

6.1 Feature Signal Patterns

Look for concrete affordances:

REST endpoint
CLI command
UI component
configuration option
SDK method
background job
database migration
import/export format
plugin hook

Examples:

features:
  - name: /classify-email endpoint
  - name: classify-email CLI command
  - name: department-rules.yaml config
  - name: JSON result export

6.2 Feature Naming Rule

Feature names should be concrete and inspectable.

Good:

POST /api/classify-email
classify-email CLI command
Rule Configuration File
PDF Upload Component

Bad:

AI routing
Document understanding
Magic extraction

7. Evidence Extraction Heuristics

Evidence supports claims.

7.1 Evidence Types

evidence_types:
  unit_test
  integration_test
  example
  demo
  benchmark
  documentation
  API specification
  production usage note
  manual review

7.2 Evidence Mapping

Map evidence to the nearest capability.

Example:

tests/test_email_classifier.py

Supports:

Classify Incoming Email

Example:

examples/invoice_extraction_demo.py

Supports:

Extract Invoice Metadata

7.3 Evidence Strength

evidence_strength:
  strong:
    - automated tests
    - benchmark results
    - executable examples
    - integration tests

  medium:
    - documentation
    - tutorials
    - screenshots
    - sample output

  weak:
    - README claim only
    - comments
    - filename hints

8. AbilityCapabilityFeature Linking

Ability explains why.
Capability explains what.
Feature explains how/where.

Example:

ability:
  name: Business Email Routing

capability:
  name: Classify Incoming Email
  supports:
    - Business Email Routing

feature:
  name: POST /api/classify-email
  implements:
    - Classify Incoming Email

8.2 Linking Heuristic

A capability supports an ability if:

Removing the capability would weaken the repositorys ability to deliver that usefulness.

A feature implements a capability if:

The feature is an interface, component, or code location through which the behavior is performed or exposed.

9. Confidence Scoring

Use a simple additive model first.

9.1 Candidate Confidence Factors

confidence_factors:
  explicit_doc_claim: +0.30
  example_present: +0.20
  test_present: +0.25
  implementation_location_found: +0.15
  api_or_cli_exposed: +0.15
  multiple_source_agreement: +0.20
  inferred_from_names_only: -0.25
  no_evidence: -0.30

Normalize to:

0.0  1.0

9.2 Confidence Labels

0.80 - 1.00: high
0.50 - 0.79: medium
0.20 - 0.49: low
0.00 - 0.19: speculative

10. Classification Rules

10.1 Is it an Ability?

Ask:

Would a user search for this as a desired outcome?

If yes, probably ability.

Example:

“I need document classification.”

Ability.


10.2 Is it a Capability?

Ask:

Can this behavior be tested with input/output expectations?

If yes, probably capability.

Example:

“Classify document into category.”

Capability.


10.3 Is it a Feature?

Ask:

Is this a concrete interface, option, component, or implementation artifact?

If yes, probably feature.

Example:

“POST /api/classify-document”

Feature.


11. Anti-Heuristics

Things the extractor should avoid.

11.1 Do Not Treat Dependencies as Capabilities

Bad:

capability: Uses OpenAI

Better:

feature: OpenAI provider integration
capability: Generate Text Summary

11.2 Do Not Treat Technology as Ability

Bad:

ability: FastAPI

Better:

feature: FastAPI REST interface

11.3 Do Not Treat Internal Helpers as Capabilities

Bad:

capability: Parse YAML Config

Unless parsing YAML config is a user-visible behavior.


11.4 Avoid Vendor-Hype Terms

Bad:

intelligent automation
next-gen AI
enterprise-ready transformation

Convert into testable candidates:

Classify Documents
Generate Reports
Route Tasks

12. Extraction Pipeline v0.1

Step 1 — Repository Intake

Collect:

README
docs
examples
tests
package files
source tree
API routes
CLI definitions

Step 2 — Structural Summary

Produce:

repository_summary:
  languages: []
  frameworks: []
  interfaces: []
  docs_found: []
  tests_found: []
  examples_found: []

Step 3 — Candidate Ability Extraction

From README/docs/package descriptions.

Output:

candidate_abilities:
  - name
  - description
  - confidence
  - supporting_sources

Step 4 — Candidate Capability Extraction

From APIs, tests, examples, public modules.

Output:

candidate_capabilities:
  - name
  - description
  - inputs
  - outputs
  - linked_abilities
  - confidence
  - supporting_sources

Step 5 — Candidate Feature Extraction

From endpoints, CLI commands, config files, UI components, modules.

Output:

candidate_features:
  - name
  - type
  - location
  - linked_capabilities
  - confidence

Step 6 — Evidence Linking

Attach evidence:

evidence:
  - type
  - path
  - supports
  - strength

Step 7 — Review Package

Generate a curator-friendly review view:

Ability
  Capability
    Feature
    Evidence

13. Example Extraction

Given README:

MailRouter helps companies automatically classify incoming emails and route them to the right department.

Given route:

POST /api/classify-email

Given test:

tests/test_email_classification.py

Output:

abilities:
  - id: ability.business_email_routing
    name: Business Email Routing
    confidence: 0.9

capabilities:
  - id: capability.classify_incoming_email
    name: Classify Incoming Email
    ability_refs:
      - ability.business_email_routing
    confidence: 0.85

features:
  - id: feature.classify_email_endpoint
    name: POST /api/classify-email
    type: REST endpoint
    location: src/routes/classify_email.py
    capability_refs:
      - capability.classify_incoming_email

evidence:
  - type: unit_test
    path: tests/test_email_classification.py
    supports:
      - capability.classify_incoming_email
    strength: strong

14. MVP Principle

The extractor should be:

conservative
explainable
reviewable
source-linked

Not magical.

The best first version is not the one that extracts everything.

It is the one where the user says:

“Yes, I understand why the system proposed this.”

xxx