added initial concept documents

This commit is contained in:
2026-04-25 21:15:17 +02:00
parent 4bcd22a518
commit 8d3d5aab42
7 changed files with 3052 additions and 0 deletions

View File

@@ -0,0 +1,822 @@
AbilityExtractionHeuristics
*How repositories will be explored*
# Ability / Capability Extraction Heuristics v0.1
## Repository Ability Registry
## 1. Purpose
The extraction engine should answer:
> “What is this repository useful for, what bounded behaviors does it provide, and where are those behaviors implemented?”
It should produce **candidate entries**, not final truth. Human/agent review remains part of the workflow.
---
# 2. Extraction Layers
```text
Ability → usefulness / problem class
Capability → bounded behavior
Feature → concrete interface or implementation
Evidence → reason to believe the claim
```
---
# 3. Source Priority
Not all repository signals are equally trustworthy.
## Priority 1 — High Trust
Use these first:
```text
README
docs/
examples/
tests/
API specs
CLI help
package metadata
```
These usually express intended usage.
## Priority 2 — Medium Trust
```text
module names
function names
class names
route names
config files
workflow files
```
These show implemented structure.
## Priority 3 — Low Trust
```text
comments
commit messages
dependency names
directory names alone
```
Useful as supporting signals, but not enough by themselves.
---
# 4. Ability Extraction Heuristics
Abilities describe **why the repository is useful**.
## 4.1 Ability Signal Patterns
Look for phrases like:
```text
"helps users..."
"enables..."
"automates..."
"provides a way to..."
"used for..."
"designed to..."
"allows..."
"supports..."
```
Example:
```text
"This library helps route incoming business emails."
```
Candidate ability:
```yaml
name: Business Email Routing
```
---
## 4.2 Ability Naming Rule
Ability names should be:
```text
Domain + Problem Class
```
Good:
```text
Business Email Routing
Document Classification
Invoice Data Extraction
Kubernetes Deployment Inspection
Agent Workflow Orchestration
```
Bad:
```text
Fast API
Email Button
Classifier
Uses GPT
```
---
## 4.3 Ability Extraction Sources
Best sources for abilities:
```text
README intro
project tagline
docs overview
examples index
package description
```
Ability is usually described in prose, not code.
---
## 4.4 Ability Confidence
Assign confidence based on signal quality:
```yaml
confidence:
high:
- explicitly stated in README/docs
- supported by examples
- supported by tests or APIs
medium:
- inferred from multiple capabilities/features
- visible in examples but not stated
low:
- inferred from names only
- based on dependencies or folder structure
```
---
# 5. Capability Extraction Heuristics
Capabilities describe **bounded behavior**.
## 5.1 Capability Signal Patterns
Look for verbs applied to objects:
```text
classify email
extract invoice data
summarize document
validate schema
generate response
deploy service
monitor cluster
route ticket
convert format
```
Pattern:
```text
Verb + Object
```
Examples:
```text
Classify Email Intent
Extract Invoice Metadata
Generate Routing Explanation
Validate Repository Metadata
```
---
## 5.2 Capability Naming Rule
Capability names should be:
```text
Action Verb + Domain Object
```
Good:
```text
Classify Incoming Email
Extract PDF Metadata
Generate API Client
Validate Kubernetes Manifest
Detect Broken Links
```
Bad:
```text
Email Capability
Parser
Smart Document Stuff
Endpoint
```
---
## 5.3 Capability Sources
Best sources:
```text
API route names
CLI commands
public functions
service classes
tests
examples
docs tutorials
```
Capability is often visible in code and tests.
---
## 5.4 Capability Boundary Rule
A capability should be small enough to test.
Good:
```text
Extract invoice date from PDF
Classify email into intent category
Generate markdown from DOCX
```
Too broad:
```text
Manage documents
Automate business
Understand everything
```
Too narrow:
```text
Read config variable
Call helper function
Trim whitespace
```
Rule of thumb:
> If you can write a meaningful acceptance test for it, it is probably a capability.
---
# 6. Feature Extraction Heuristics
Features describe **how the capability is exposed or implemented**.
## 6.1 Feature Signal Patterns
Look for concrete affordances:
```text
REST endpoint
CLI command
UI component
configuration option
SDK method
background job
database migration
import/export format
plugin hook
```
Examples:
```yaml
features:
- name: /classify-email endpoint
- name: classify-email CLI command
- name: department-rules.yaml config
- name: JSON result export
```
---
## 6.2 Feature Naming Rule
Feature names should be concrete and inspectable.
Good:
```text
POST /api/classify-email
classify-email CLI command
Rule Configuration File
PDF Upload Component
```
Bad:
```text
AI routing
Document understanding
Magic extraction
```
---
# 7. Evidence Extraction Heuristics
Evidence supports claims.
## 7.1 Evidence Types
```yaml
evidence_types:
unit_test
integration_test
example
demo
benchmark
documentation
API specification
production usage note
manual review
```
---
## 7.2 Evidence Mapping
Map evidence to the nearest capability.
Example:
```text
tests/test_email_classifier.py
```
Supports:
```text
Classify Incoming Email
```
Example:
```text
examples/invoice_extraction_demo.py
```
Supports:
```text
Extract Invoice Metadata
```
---
## 7.3 Evidence Strength
```yaml
evidence_strength:
strong:
- automated tests
- benchmark results
- executable examples
- integration tests
medium:
- documentation
- tutorials
- screenshots
- sample output
weak:
- README claim only
- comments
- filename hints
```
---
# 8. AbilityCapabilityFeature Linking
## 8.1 Link Rule
```text
Ability explains why.
Capability explains what.
Feature explains how/where.
```
Example:
```yaml
ability:
name: Business Email Routing
capability:
name: Classify Incoming Email
supports:
- Business Email Routing
feature:
name: POST /api/classify-email
implements:
- Classify Incoming Email
```
---
## 8.2 Linking Heuristic
A capability supports an ability if:
```text
Removing the capability would weaken the repositorys ability to deliver that usefulness.
```
A feature implements a capability if:
```text
The feature is an interface, component, or code location through which the behavior is performed or exposed.
```
---
# 9. Confidence Scoring
Use a simple additive model first.
## 9.1 Candidate Confidence Factors
```yaml
confidence_factors:
explicit_doc_claim: +0.30
example_present: +0.20
test_present: +0.25
implementation_location_found: +0.15
api_or_cli_exposed: +0.15
multiple_source_agreement: +0.20
inferred_from_names_only: -0.25
no_evidence: -0.30
```
Normalize to:
```text
0.0 1.0
```
## 9.2 Confidence Labels
```yaml
0.80 - 1.00: high
0.50 - 0.79: medium
0.20 - 0.49: low
0.00 - 0.19: speculative
```
---
# 10. Classification Rules
## 10.1 Is it an Ability?
Ask:
```text
Would a user search for this as a desired outcome?
```
If yes, probably ability.
Example:
```text
“I need document classification.”
```
Ability.
---
## 10.2 Is it a Capability?
Ask:
```text
Can this behavior be tested with input/output expectations?
```
If yes, probably capability.
Example:
```text
“Classify document into category.”
```
Capability.
---
## 10.3 Is it a Feature?
Ask:
```text
Is this a concrete interface, option, component, or implementation artifact?
```
If yes, probably feature.
Example:
```text
“POST /api/classify-document”
```
Feature.
---
# 11. Anti-Heuristics
Things the extractor should avoid.
## 11.1 Do Not Treat Dependencies as Capabilities
Bad:
```yaml
capability: Uses OpenAI
```
Better:
```yaml
feature: OpenAI provider integration
capability: Generate Text Summary
```
---
## 11.2 Do Not Treat Technology as Ability
Bad:
```yaml
ability: FastAPI
```
Better:
```yaml
feature: FastAPI REST interface
```
---
## 11.3 Do Not Treat Internal Helpers as Capabilities
Bad:
```yaml
capability: Parse YAML Config
```
Unless parsing YAML config is a user-visible behavior.
---
## 11.4 Avoid Vendor-Hype Terms
Bad:
```text
intelligent automation
next-gen AI
enterprise-ready transformation
```
Convert into testable candidates:
```text
Classify Documents
Generate Reports
Route Tasks
```
---
# 12. Extraction Pipeline v0.1
## Step 1 — Repository Intake
Collect:
```text
README
docs
examples
tests
package files
source tree
API routes
CLI definitions
```
---
## Step 2 — Structural Summary
Produce:
```yaml
repository_summary:
languages: []
frameworks: []
interfaces: []
docs_found: []
tests_found: []
examples_found: []
```
---
## Step 3 — Candidate Ability Extraction
From README/docs/package descriptions.
Output:
```yaml
candidate_abilities:
- name
- description
- confidence
- supporting_sources
```
---
## Step 4 — Candidate Capability Extraction
From APIs, tests, examples, public modules.
Output:
```yaml
candidate_capabilities:
- name
- description
- inputs
- outputs
- linked_abilities
- confidence
- supporting_sources
```
---
## Step 5 — Candidate Feature Extraction
From endpoints, CLI commands, config files, UI components, modules.
Output:
```yaml
candidate_features:
- name
- type
- location
- linked_capabilities
- confidence
```
---
## Step 6 — Evidence Linking
Attach evidence:
```yaml
evidence:
- type
- path
- supports
- strength
```
---
## Step 7 — Review Package
Generate a curator-friendly review view:
```text
Ability
Capability
Feature
Evidence
```
---
# 13. Example Extraction
Given README:
```text
MailRouter helps companies automatically classify incoming emails and route them to the right department.
```
Given route:
```text
POST /api/classify-email
```
Given test:
```text
tests/test_email_classification.py
```
Output:
```yaml
abilities:
- id: ability.business_email_routing
name: Business Email Routing
confidence: 0.9
capabilities:
- id: capability.classify_incoming_email
name: Classify Incoming Email
ability_refs:
- ability.business_email_routing
confidence: 0.85
features:
- id: feature.classify_email_endpoint
name: POST /api/classify-email
type: REST endpoint
location: src/routes/classify_email.py
capability_refs:
- capability.classify_incoming_email
evidence:
- type: unit_test
path: tests/test_email_classification.py
supports:
- capability.classify_incoming_email
strength: strong
```
---
# 14. MVP Principle
The extractor should be:
```text
conservative
explainable
reviewable
source-linked
```
Not magical.
The best first version is not the one that extracts everything.
It is the one where the user says:
> “Yes, I understand why the system proposed this.”
xxx