Skip to main content

Universal Evolutionary Feature Discovery and Selection Framework

Project description

UEFDS: Universal Evolutionary Feature Discovery and Selection Framework

UEFDS is a feature-evolution pipeline for automated feature engineering:

  1. Stage 1 creates many new feature formulas from your raw features.
  2. Compression keeps only distinct survivors (removes near-duplicates).
  3. Stage 2 chooses the final survivor subset for model training and usage.

The goal is to produce a large, useful feature library for users to reuse, then a compact final subset for deployment, with lineage tracking so every surviving feature can be traced back to its parents.

1) What Problem This Solves

Many datasets need feature engineering before models perform well. Manually crafting features is slow and fragile. UEFDS automates this process by:

  • Growing a large pool of derived feature formulas from raw inputs.
  • Evolving and scoring formulas generation by generation.
  • Keeping stronger, less redundant feature survivors.
  • Returning both an expandable candidate library and a final selected subset.

2) End-to-End Pipeline

flowchart TD
    A[Raw X, y] --> B[Stage 1: Feature Discovery]
    B --> C[All Generated Features]
    C --> D[Hall of Fame Survivors]
    D --> E[Compression: Keep Distinct Survivors]
    E --> F[Candidate Feature Library for User]
    F --> G[Stage 2: Final Survivor Subset]
    G --> H[Best Subset + Final XGBoost Model]

Feature Evolution View (What Survives)

flowchart LR
    R[Raw Features] --> G0[Gen 0: Random Features]
    G0 --> G1[Gen 1: New Derived Features]
    G1 --> G2[Gen 2: More Derived Features]
    G2 --> HF[Hall of Fame Survivors]
    HF --> CP[Compressed Survivors]
    CP --> S2[Stage 2 Subset Survivors]

Think of UEFDS as survival-of-the-most-useful-features, not survival-of-operations.

3) Installation

Core + Dev + Visualization

pip install -e ".[dev,viz]"

Core dependencies include:

  • numpy
  • scikit-learn
  • scipy
  • xgboost
  • shap

Optional extras:

  • dev: pytest
  • viz: matplotlib

4) Quick Start

import numpy as np
from sklearn.datasets import load_breast_cancer

from uefds import (
    Stage1Config,
    run_stage1,
    build_candidate_library,
    Stage2Config,
    run_stage2,
)

# Load data
bundle = load_breast_cancer()
X, y = bundle.data, bundle.target
feature_names = list(bundle.feature_names)

# Stage 1: discover candidate formulas
stage1_result = run_stage1(
    X,
    y,
    feature_names,
    Stage1Config(
        population_size=40,
        n_generations=15,
        task="classification",
        protected_features=[0],
        presence_quota_k=2,
        seed=42,
    ),
)

print(stage1_result["hall_of_fame"].summary(n=5))

# Compress Hall of Fame into candidate library
candidates = build_candidate_library(
    stage1_result["hall_of_fame"],
    stage1_result["trees_by_id"],
    X,
    feature_names,
    tau=0.9,
    top_n=30,
)

X_candidates = np.column_stack([c.values for c in candidates])

# Stage 2: evolve best subset
stage2_result = run_stage2(
    candidates,
    X_candidates,
    y,
    Stage2Config(
        population_size=20,
        n_generations=10,
        task="classification",
        min_features=2,
        max_features=min(10, len(candidates)),
        cv_folds=3,
        seed=42,
    ),
)

print("Best formulas:")
for f in stage2_result["best_formulas"]:
    print(" ", f)

print("Final metrics:", stage2_result["final_metrics"])

5) Stage 1: Feature Discovery

What evolves in Stage 1

Each chromosome is one candidate feature formula (represented internally as an expression tree):

  • Leaves reference raw feature indices.
  • Internal structure defines how a new derived feature is computed.
  • The output of each chromosome is a feature vector over all rows.

Fitness function

Stage 1 fitness is a weighted sum:

$$ F_i = w_Q Q_i + w_D D_i + w_S S_i $$

Where:

  • $Q_i$: quality score (mutual information with target)
  • $D_i$: diversity score ($1 - \max |corr|$ against already-kept features in evaluation pass)
  • $S_i$: simplicity score from tree size

Default Stage 1 weights:

  • quality: 0.5
  • diversity: 0.3
  • simplicity: 0.2

How feature survivors are chosen each generation

  • Every generated feature gets a fitness score (quality + diversity + simplicity).
  • Top features survive directly via elitism.
  • Additional features are created from survivors and re-scored.
  • Over generations, weak/redundant features are naturally filtered out.

Internal breeding operators exist, but from a user perspective the key output is the survivor feature pool, not the operator details.

Protection mechanism

You can protect important raw features by index using:

  • protected_features
  • presence_quota_k

UEFDS enforces a minimum population presence for protected features each generation.

Stage 1 output keys

run_stage1 returns:

  • hall_of_fame: top formulas archive
  • operator_learner: operator success tracker
  • evolution_history: generation and lineage records
  • final_population: last generation trees
  • trees_by_id: dictionary of created trees by tree_id
  • best_tree_id: best overall tree_id from Hall of Fame

If your goal is to create more features for downstream usage, focus on:

  • hall_of_fame (best discovered formulas)
  • trees_by_id (full feature objects)
  • build_candidate_library(...) output (clean, reusable candidate features)

6) Compression Layer

Compression keeps only distinct feature survivors before Stage 2.

Method

  1. Evaluate selected Hall-of-Fame trees on X.
  2. Compute correlation matrix among candidate feature vectors.
  3. Convert to distance:

$$ \text{distance}(f_i, f_j) = 1 - |corr(f_i, f_j)| $$

  1. Perform hierarchical clustering.
  2. Keep one representative survivor per cluster (highest fitness).

Each compressed candidate stores:

  • tree_id
  • formula
  • fitness
  • values
  • complexity (tree node count)

7) Stage 2: Feature Subset Selection

Stage 2 chromosome = a subset choice over compressed feature survivors.

Stage 2 fitness

Current Stage 2 fitness combines:

$$ F = w_P P + w_{St} St + w_{Sh} Sh + w_{Si} Si - w_C C $$

Where:

  • $P$: cross-validated model performance (AUC for classification, $R^2$ for regression)
  • $St$: fold stability term
  • $Sh$: SHAP efficiency score
  • $Si$: simplicity term derived from candidate complexity
  • $C$: feature count penalty

Default Stage 2 weights:

  • performance: 0.50
  • stability: 0.15
  • shap: 0.15
  • simplicity: 0.10
  • count_penalty: 0.10

Why this matters

  • Performance ensures predictive power.
  • Stability avoids fragile subsets.
  • SHAP efficiency prefers non-redundant explanatory features.
  • Simplicity favors less complex formulas.
  • Count penalty discourages oversized subsets.

Stage 2 output keys

run_stage2 returns:

  • selection_history: subset lineage records
  • best_subset_id
  • best_feature_indices
  • best_formulas
  • final_model (trained XGBoost model)
  • final_metrics

8) Lineage and Interpretability

UEFDS tracks ancestry in both stages.

  • Stage 1: trace feature formulas back to random-init ancestors.
  • Stage 2: trace winning subset ancestry by parent subset IDs.

Examples:

stage1_result["evolution_history"].print_lineage(stage1_result["best_tree_id"])
stage2_result["selection_history"].print_lineage(stage2_result["best_subset_id"])

9) Visualization

Run the visualization demo:

python examples/demo_visualization.py

It generates figures in examples/viz_out for:

  • best tree
  • generation gallery
  • fitness evolution
  • feature presence curves
  • lineage graph

10) Example Scripts

  • examples/demo_stage1.py

    • Stage 1 only
    • prints Hall of Fame and feature presence summary
  • examples/demo_full_pipeline.py

    • Stage 1 + Compression + Stage 2
    • prints final subset and final metrics
  • examples/demo_visualization.py

    • Stage 1 with plotting pipeline

11) Running Tests

pytest -q

12) Repository Layout

  • src/uefds/operators.py: operator library and operator learner
  • src/uefds/tree.py: expression-tree chromosome and genetic ops
  • src/uefds/fitness.py: Stage 1 scoring
  • src/uefds/hall_of_fame.py: top-discovery archive
  • src/uefds/evolution_history.py: lineage and generation logs
  • src/uefds/stage1_discovery.py: Stage 1 loop
  • src/uefds/compression.py: correlation clustering compression
  • src/uefds/model_engine.py: XGBoost training and CV scoring
  • src/uefds/shap_layer.py: SHAP importance and redundancy logic
  • src/uefds/stage2_selection.py: Stage 2 loop
  • src/uefds/visualization.py: plotting utilities
  • src/uefds/init.py: package API exports

13) Practical Notes

  • Seed all runs for reproducibility using Stage1Config.seed and Stage2Config.seed.
  • Candidate count and Stage 2 population size strongly affect runtime.
  • Start with small generation counts while iterating.
  • Increase top_n and Stage 2 generations for stronger final search when ready.

14) Minimal Troubleshooting

  • Import errors for xgboost/shap: install with pip install -e ".[dev,viz]".
  • Empty candidate library: increase top_n, reduce compression strictness (lower tau), or increase Stage 1 generations.
  • Very slow Stage 2: reduce population_size, n_generations, or cv_folds.

15) License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uefds-1.0.1.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uefds-1.0.1-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file uefds-1.0.1.tar.gz.

File metadata

  • Download URL: uefds-1.0.1.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for uefds-1.0.1.tar.gz
Algorithm Hash digest
SHA256 adb0a9d79213315e4aa99bfa8912df0ba9e4f1ace5600cb9d599245a563a918c
MD5 af405f52b0a6ccc8fe8cb091b3e9795c
BLAKE2b-256 394dab4b868121835598165e1c0ac74dcb936dae1dd744420c156b8d234cb0c4

See more details on using hashes here.

File details

Details for the file uefds-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: uefds-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for uefds-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9c2b3d0f91c71499f103966edae5814d926152e9a91e7fc151d82f3df6aa3f70
MD5 b61c26c5475b8bdc23ab0ff8bc2c4d78
BLAKE2b-256 b6cb01d20c3a2dcfed3956260b3ba15387be7c582f69186cb46bd81efada3195

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page