Universal Evolutionary Feature Discovery and Selection Framework

These details have not been verified by PyPI

Project links

Repository

Project description

UEFDS: Universal Evolutionary Feature Discovery and Selection Framework

UEFDS is a feature-evolution pipeline for automated feature engineering:

Stage 1 creates many new feature formulas from your raw features.
Compression keeps only distinct survivors (removes near-duplicates).
Stage 2 chooses the final survivor subset for model training and usage.

The goal is to produce a large, useful feature library for users to reuse, then a compact final subset for deployment, with lineage tracking so every surviving feature can be traced back to its parents.

1) What Problem This Solves

Many datasets need feature engineering before models perform well. Manually crafting features is slow and fragile. UEFDS automates this process by:

Growing a large pool of derived feature formulas from raw inputs.
Evolving and scoring formulas generation by generation.
Keeping stronger, less redundant feature survivors.
Returning both an expandable candidate library and a final selected subset.

2) End-to-End Pipeline

flowchart TD
    A[Raw X, y] --> B[Stage 1: Feature Discovery]
    B --> C[All Generated Features]
    C --> D[Hall of Fame Survivors]
    D --> E[Compression: Keep Distinct Survivors]
    E --> F[Candidate Feature Library for User]
    F --> G[Stage 2: Final Survivor Subset]
    G --> H[Best Subset + Final XGBoost Model]

Feature Evolution View (What Survives)

flowchart LR
    R[Raw Features] --> G0[Gen 0: Random Features]
    G0 --> G1[Gen 1: New Derived Features]
    G1 --> G2[Gen 2: More Derived Features]
    G2 --> HF[Hall of Fame Survivors]
    HF --> CP[Compressed Survivors]
    CP --> S2[Stage 2 Subset Survivors]

Think of UEFDS as survival-of-the-most-useful-features, not survival-of-operations.

3) Installation

Core + Dev + Visualization

pip install -e ".[dev,viz]"

Core dependencies include:

numpy
scikit-learn
scipy
xgboost
shap

Optional extras:

dev: pytest
viz: matplotlib

4) Quick Start

import numpy as np
from sklearn.datasets import load_breast_cancer

from uefds import (
    Stage1Config,
    run_stage1,
    build_candidate_library,
    Stage2Config,
    run_stage2,
)

# Load data
bundle = load_breast_cancer()
X, y = bundle.data, bundle.target
feature_names = list(bundle.feature_names)

# Stage 1: discover candidate formulas
stage1_result = run_stage1(
    X,
    y,
    feature_names,
    Stage1Config(
        population_size=40,
        n_generations=15,
        task="classification",
        protected_features=[0],
        presence_quota_k=2,
        seed=42,
    ),
)

print(stage1_result["hall_of_fame"].summary(n=5))

# Compress Hall of Fame into candidate library
candidates = build_candidate_library(
    stage1_result["hall_of_fame"],
    stage1_result["trees_by_id"],
    X,
    feature_names,
    tau=0.9,
    top_n=30,
)

X_candidates = np.column_stack([c.values for c in candidates])

# Stage 2: evolve best subset
stage2_result = run_stage2(
    candidates,
    X_candidates,
    y,
    Stage2Config(
        population_size=20,
        n_generations=10,
        task="classification",
        min_features=2,
        max_features=min(10, len(candidates)),
        cv_folds=3,
        seed=42,
    ),
)

print("Best formulas:")
for f in stage2_result["best_formulas"]:
    print(" ", f)

print("Final metrics:", stage2_result["final_metrics"])

5) Stage 1: Feature Discovery

What evolves in Stage 1

Each chromosome is one candidate feature formula (represented internally as an expression tree):

Leaves reference raw feature indices.
Internal structure defines how a new derived feature is computed.
The output of each chromosome is a feature vector over all rows.

Fitness function

Stage 1 fitness is a weighted sum:

$$ F_i = w_Q Q_i + w_D D_i + w_S S_i $$

Where:

$Q_i$: quality score (mutual information with target)
$D_i$: diversity score ($1 - \max |corr|$ against already-kept features in evaluation pass)
$S_i$: simplicity score from tree size

Default Stage 1 weights:

quality: 0.5
diversity: 0.3
simplicity: 0.2

How feature survivors are chosen each generation

Every generated feature gets a fitness score (quality + diversity + simplicity).
Top features survive directly via elitism.
Additional features are created from survivors and re-scored.
Over generations, weak/redundant features are naturally filtered out.

Internal breeding operators exist, but from a user perspective the key output is the survivor feature pool, not the operator details.

Protection mechanism

You can protect important raw features by index using:

protected_features
presence_quota_k

UEFDS enforces a minimum population presence for protected features each generation.

Stage 1 output keys

run_stage1 returns:

hall_of_fame: top formulas archive
operator_learner: operator success tracker
evolution_history: generation and lineage records
final_population: last generation trees
trees_by_id: dictionary of created trees by tree_id
best_tree_id: best overall tree_id from Hall of Fame

If your goal is to create more features for downstream usage, focus on:

hall_of_fame (best discovered formulas)
trees_by_id (full feature objects)
build_candidate_library(...) output (clean, reusable candidate features)

6) Compression Layer

Compression keeps only distinct feature survivors before Stage 2.

Method

Evaluate selected Hall-of-Fame trees on X.
Compute correlation matrix among candidate feature vectors.
Convert to distance:

$$ \text{distance}(f_i, f_j) = 1 - |corr(f_i, f_j)| $$

Perform hierarchical clustering.
Keep one representative survivor per cluster (highest fitness).

Each compressed candidate stores:

tree_id
formula
fitness
values
complexity (tree node count)

7) Stage 2: Feature Subset Selection

Stage 2 chromosome = a subset choice over compressed feature survivors.

Stage 2 fitness

Current Stage 2 fitness combines:

$$ F = w_P P + w_{St} St + w_{Sh} Sh + w_{Si} Si - w_C C $$

Where:

$P$: cross-validated model performance (AUC for classification, $R^2$ for regression)
$St$: fold stability term
$Sh$: SHAP efficiency score
$Si$: simplicity term derived from candidate complexity
$C$: feature count penalty

Default Stage 2 weights:

performance: 0.50
stability: 0.15
shap: 0.15
simplicity: 0.10
count_penalty: 0.10

Why this matters

Performance ensures predictive power.
Stability avoids fragile subsets.
SHAP efficiency prefers non-redundant explanatory features.
Simplicity favors less complex formulas.
Count penalty discourages oversized subsets.

Stage 2 output keys

run_stage2 returns:

selection_history: subset lineage records
best_subset_id
best_feature_indices
best_formulas
final_model (trained XGBoost model)
final_metrics

8) Lineage and Interpretability

UEFDS tracks ancestry in both stages.

Stage 1: trace feature formulas back to random-init ancestors.
Stage 2: trace winning subset ancestry by parent subset IDs.

Examples:

stage1_result["evolution_history"].print_lineage(stage1_result["best_tree_id"])
stage2_result["selection_history"].print_lineage(stage2_result["best_subset_id"])

9) Visualization

Run the visualization demo:

python examples/demo_visualization.py

It generates figures in examples/viz_out for:

best tree
generation gallery
fitness evolution
feature presence curves
lineage graph

10) Example Scripts

examples/demo_stage1.py
- Stage 1 only
- prints Hall of Fame and feature presence summary
examples/demo_full_pipeline.py
- Stage 1 + Compression + Stage 2
- prints final subset and final metrics
examples/demo_visualization.py
- Stage 1 with plotting pipeline

11) Running Tests

pytest -q

12) Repository Layout

src/uefds/operators.py: operator library and operator learner
src/uefds/tree.py: expression-tree chromosome and genetic ops
src/uefds/fitness.py: Stage 1 scoring
src/uefds/hall_of_fame.py: top-discovery archive
src/uefds/evolution_history.py: lineage and generation logs
src/uefds/stage1_discovery.py: Stage 1 loop
src/uefds/compression.py: correlation clustering compression
src/uefds/model_engine.py: XGBoost training and CV scoring
src/uefds/shap_layer.py: SHAP importance and redundancy logic
src/uefds/stage2_selection.py: Stage 2 loop
src/uefds/visualization.py: plotting utilities
src/uefds/init.py: package API exports

13) Practical Notes

Seed all runs for reproducibility using Stage1Config.seed and Stage2Config.seed.
Candidate count and Stage 2 population size strongly affect runtime.
Start with small generation counts while iterating.
Increase top_n and Stage 2 generations for stronger final search when ready.

14) Minimal Troubleshooting

Import errors for xgboost/shap: install with pip install -e ".[dev,viz]".
Empty candidate library: increase top_n, reduce compression strictness (lower tau), or increase Stage 1 generations.
Very slow Stage 2: reduce population_size, n_generations, or cv_folds.

15) License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

1.0.1

Jul 3, 2026

0.1.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uefds-1.0.1.tar.gz (30.2 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uefds-1.0.1-py3-none-any.whl (30.9 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file uefds-1.0.1.tar.gz.

File metadata

Download URL: uefds-1.0.1.tar.gz
Upload date: Jul 3, 2026
Size: 30.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for uefds-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`adb0a9d79213315e4aa99bfa8912df0ba9e4f1ace5600cb9d599245a563a918c`
MD5	`af405f52b0a6ccc8fe8cb091b3e9795c`
BLAKE2b-256	`394dab4b868121835598165e1c0ac74dcb936dae1dd744420c156b8d234cb0c4`

See more details on using hashes here.

File details

Details for the file uefds-1.0.1-py3-none-any.whl.

File metadata

Download URL: uefds-1.0.1-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for uefds-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c2b3d0f91c71499f103966edae5814d926152e9a91e7fc151d82f3df6aa3f70`
MD5	`b61c26c5475b8bdc23ab0ff8bc2c4d78`
BLAKE2b-256	`b6cb01d20c3a2dcfed3956260b3ba15387be7c582f69186cb46bd81efada3195`

See more details on using hashes here.

uefds 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

UEFDS: Universal Evolutionary Feature Discovery and Selection Framework

1) What Problem This Solves

2) End-to-End Pipeline

Feature Evolution View (What Survives)

3) Installation

Core + Dev + Visualization

4) Quick Start

5) Stage 1: Feature Discovery

What evolves in Stage 1

Fitness function

How feature survivors are chosen each generation

Protection mechanism

Stage 1 output keys

6) Compression Layer

Method

7) Stage 2: Feature Subset Selection

Stage 2 fitness

Why this matters

Stage 2 output keys

8) Lineage and Interpretability

9) Visualization

10) Example Scripts

11) Running Tests

12) Repository Layout

13) Practical Notes

14) Minimal Troubleshooting

15) License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes