Universal Evolutionary Feature Discovery and Selection Framework
Project description
UEFDS: Universal Evolutionary Feature Discovery and Selection Framework
UEFDS is a feature-evolution pipeline for automated feature engineering:
- Stage 1 creates many new feature formulas from your raw features.
- Compression keeps only distinct survivors (removes near-duplicates).
- Stage 2 chooses the final survivor subset for model training and usage.
The goal is to produce a large, useful feature library for users to reuse, then a compact final subset for deployment, with lineage tracking so every surviving feature can be traced back to its parents.
1) What Problem This Solves
Many datasets need feature engineering before models perform well. Manually crafting features is slow and fragile. UEFDS automates this process by:
- Growing a large pool of derived feature formulas from raw inputs.
- Evolving and scoring formulas generation by generation.
- Keeping stronger, less redundant feature survivors.
- Returning both an expandable candidate library and a final selected subset.
2) End-to-End Pipeline
flowchart TD
A[Raw X, y] --> B[Stage 1: Feature Discovery]
B --> C[All Generated Features]
C --> D[Hall of Fame Survivors]
D --> E[Compression: Keep Distinct Survivors]
E --> F[Candidate Feature Library for User]
F --> G[Stage 2: Final Survivor Subset]
G --> H[Best Subset + Final XGBoost Model]
Feature Evolution View (What Survives)
flowchart LR
R[Raw Features] --> G0[Gen 0: Random Features]
G0 --> G1[Gen 1: New Derived Features]
G1 --> G2[Gen 2: More Derived Features]
G2 --> HF[Hall of Fame Survivors]
HF --> CP[Compressed Survivors]
CP --> S2[Stage 2 Subset Survivors]
Think of UEFDS as survival-of-the-most-useful-features, not survival-of-operations.
3) Installation
Core + Dev + Visualization
pip install -e ".[dev,viz]"
Core dependencies include:
- numpy
- scikit-learn
- scipy
- xgboost
- shap
Optional extras:
- dev: pytest
- viz: matplotlib
4) Quick Start
import numpy as np
from sklearn.datasets import load_breast_cancer
from uefds import (
Stage1Config,
run_stage1,
build_candidate_library,
Stage2Config,
run_stage2,
)
# Load data
bundle = load_breast_cancer()
X, y = bundle.data, bundle.target
feature_names = list(bundle.feature_names)
# Stage 1: discover candidate formulas
stage1_result = run_stage1(
X,
y,
feature_names,
Stage1Config(
population_size=40,
n_generations=15,
task="classification",
protected_features=[0],
presence_quota_k=2,
seed=42,
),
)
print(stage1_result["hall_of_fame"].summary(n=5))
# Compress Hall of Fame into candidate library
candidates = build_candidate_library(
stage1_result["hall_of_fame"],
stage1_result["trees_by_id"],
X,
feature_names,
tau=0.9,
top_n=30,
)
X_candidates = np.column_stack([c.values for c in candidates])
# Stage 2: evolve best subset
stage2_result = run_stage2(
candidates,
X_candidates,
y,
Stage2Config(
population_size=20,
n_generations=10,
task="classification",
min_features=2,
max_features=min(10, len(candidates)),
cv_folds=3,
seed=42,
),
)
print("Best formulas:")
for f in stage2_result["best_formulas"]:
print(" ", f)
print("Final metrics:", stage2_result["final_metrics"])
5) Stage 1: Feature Discovery
What evolves in Stage 1
Each chromosome is one candidate feature formula (represented internally as an expression tree):
- Leaves reference raw feature indices.
- Internal structure defines how a new derived feature is computed.
- The output of each chromosome is a feature vector over all rows.
Fitness function
Stage 1 fitness is a weighted sum:
$$ F_i = w_Q Q_i + w_D D_i + w_S S_i $$
Where:
- $Q_i$: quality score (mutual information with target)
- $D_i$: diversity score ($1 - \max |corr|$ against already-kept features in evaluation pass)
- $S_i$: simplicity score from tree size
Default Stage 1 weights:
- quality: 0.5
- diversity: 0.3
- simplicity: 0.2
How feature survivors are chosen each generation
- Every generated feature gets a fitness score (quality + diversity + simplicity).
- Top features survive directly via elitism.
- Additional features are created from survivors and re-scored.
- Over generations, weak/redundant features are naturally filtered out.
Internal breeding operators exist, but from a user perspective the key output is the survivor feature pool, not the operator details.
Protection mechanism
You can protect important raw features by index using:
- protected_features
- presence_quota_k
UEFDS enforces a minimum population presence for protected features each generation.
Stage 1 output keys
run_stage1 returns:
- hall_of_fame: top formulas archive
- operator_learner: operator success tracker
- evolution_history: generation and lineage records
- final_population: last generation trees
- trees_by_id: dictionary of created trees by tree_id
- best_tree_id: best overall tree_id from Hall of Fame
If your goal is to create more features for downstream usage, focus on:
- hall_of_fame (best discovered formulas)
- trees_by_id (full feature objects)
- build_candidate_library(...) output (clean, reusable candidate features)
6) Compression Layer
Compression keeps only distinct feature survivors before Stage 2.
Method
- Evaluate selected Hall-of-Fame trees on X.
- Compute correlation matrix among candidate feature vectors.
- Convert to distance:
$$ \text{distance}(f_i, f_j) = 1 - |corr(f_i, f_j)| $$
- Perform hierarchical clustering.
- Keep one representative survivor per cluster (highest fitness).
Each compressed candidate stores:
- tree_id
- formula
- fitness
- values
- complexity (tree node count)
7) Stage 2: Feature Subset Selection
Stage 2 chromosome = a subset choice over compressed feature survivors.
Stage 2 fitness
Current Stage 2 fitness combines:
$$ F = w_P P + w_{St} St + w_{Sh} Sh + w_{Si} Si - w_C C $$
Where:
- $P$: cross-validated model performance (AUC for classification, $R^2$ for regression)
- $St$: fold stability term
- $Sh$: SHAP efficiency score
- $Si$: simplicity term derived from candidate complexity
- $C$: feature count penalty
Default Stage 2 weights:
- performance: 0.50
- stability: 0.15
- shap: 0.15
- simplicity: 0.10
- count_penalty: 0.10
Why this matters
- Performance ensures predictive power.
- Stability avoids fragile subsets.
- SHAP efficiency prefers non-redundant explanatory features.
- Simplicity favors less complex formulas.
- Count penalty discourages oversized subsets.
Stage 2 output keys
run_stage2 returns:
- selection_history: subset lineage records
- best_subset_id
- best_feature_indices
- best_formulas
- final_model (trained XGBoost model)
- final_metrics
8) Lineage and Interpretability
UEFDS tracks ancestry in both stages.
- Stage 1: trace feature formulas back to random-init ancestors.
- Stage 2: trace winning subset ancestry by parent subset IDs.
Examples:
stage1_result["evolution_history"].print_lineage(stage1_result["best_tree_id"])
stage2_result["selection_history"].print_lineage(stage2_result["best_subset_id"])
9) Visualization
Run the visualization demo:
python examples/demo_visualization.py
It generates figures in examples/viz_out for:
- best tree
- generation gallery
- fitness evolution
- feature presence curves
- lineage graph
10) Example Scripts
-
examples/demo_stage1.py
- Stage 1 only
- prints Hall of Fame and feature presence summary
-
examples/demo_full_pipeline.py
- Stage 1 + Compression + Stage 2
- prints final subset and final metrics
-
examples/demo_visualization.py
- Stage 1 with plotting pipeline
11) Running Tests
pytest -q
12) Repository Layout
- src/uefds/operators.py: operator library and operator learner
- src/uefds/tree.py: expression-tree chromosome and genetic ops
- src/uefds/fitness.py: Stage 1 scoring
- src/uefds/hall_of_fame.py: top-discovery archive
- src/uefds/evolution_history.py: lineage and generation logs
- src/uefds/stage1_discovery.py: Stage 1 loop
- src/uefds/compression.py: correlation clustering compression
- src/uefds/model_engine.py: XGBoost training and CV scoring
- src/uefds/shap_layer.py: SHAP importance and redundancy logic
- src/uefds/stage2_selection.py: Stage 2 loop
- src/uefds/visualization.py: plotting utilities
- src/uefds/init.py: package API exports
13) Practical Notes
- Seed all runs for reproducibility using Stage1Config.seed and Stage2Config.seed.
- Candidate count and Stage 2 population size strongly affect runtime.
- Start with small generation counts while iterating.
- Increase top_n and Stage 2 generations for stronger final search when ready.
14) Minimal Troubleshooting
- Import errors for xgboost/shap: install with pip install -e ".[dev,viz]".
- Empty candidate library: increase top_n, reduce compression strictness (lower tau), or increase Stage 1 generations.
- Very slow Stage 2: reduce population_size, n_generations, or cv_folds.
15) License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uefds-1.0.1.tar.gz.
File metadata
- Download URL: uefds-1.0.1.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adb0a9d79213315e4aa99bfa8912df0ba9e4f1ace5600cb9d599245a563a918c
|
|
| MD5 |
af405f52b0a6ccc8fe8cb091b3e9795c
|
|
| BLAKE2b-256 |
394dab4b868121835598165e1c0ac74dcb936dae1dd744420c156b8d234cb0c4
|
File details
Details for the file uefds-1.0.1-py3-none-any.whl.
File metadata
- Download URL: uefds-1.0.1-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c2b3d0f91c71499f103966edae5814d926152e9a91e7fc151d82f3df6aa3f70
|
|
| MD5 |
b61c26c5475b8bdc23ab0ff8bc2c4d78
|
|
| BLAKE2b-256 |
b6cb01d20c3a2dcfed3956260b3ba15387be7c582f69186cb46bd81efada3195
|