Skip to main content

Scikit-learn models hyperparameter tuning and feature selection using evolutionary algorithms

Project description

Tests Codecov PythonVersion PyPi Conda Docs

https://github.com/rodrigo-arenas/Sklearn-genetic-opt/blob/master/docs-vitepress/public/sklearn-genetic-opt-logo-128.png?raw=true

sklearn-genetic-opt

sklearn-genetic-opt adds evolutionary optimization tools to the scikit-learn workflow. It can tune hyperparameters with GASearchCV and select feature subsets with GAFeatureSelectionCV using algorithms powered by DEAP.

The project is useful when a search space is mixed, irregular, expensive, or not well served by an exhaustive grid. It follows familiar scikit-learn patterns: define an estimator, define a search space, call fit, inspect best_params_ or support_, and use the fitted object for prediction.

Documentation is available at rodrigo-arenas.github.io/Sklearn-genetic-opt.

Highlights

  • GASearchCV for hyperparameter search across classification, regression, and supported outlier-detection estimators.

  • GAFeatureSelectionCV for wrapper-based feature selection with cross-validation.

  • Search spaces for integer, continuous, and categorical parameters.

  • Smart initial populations with PopulationConfig(initializer="smart"), including warm-start seeds, estimator defaults, Latin-hypercube numeric coverage, stratified categorical coverage, and duplicate avoidance.

  • Adaptive mutation and crossover schedules.

  • Optional local search, diversity control, random immigrants, and fitness sharing to improve exploration, avoid premature convergence, and refine good solutions.

  • Parallel candidate evaluation with n_jobs and parallel_backend.

  • Evaluation caching, optimizer telemetry through history, and fit-cost counters through fit_stats_.

  • Callbacks for early stopping, progress reporting, checkpoints, TensorBoard, and custom logic.

  • Plotting helpers plus MLflow 3 logging support.

Installation

Install the core package with pip:

pip install sklearn-genetic-opt

Or with conda from the conda-forge channel:

conda install -c conda-forge sklearn-genetic-opt

Install optional plotting, MLflow, and TensorBoard integrations with pip:

pip install sklearn-genetic-opt[all]

The conda package ships only the core dependencies. To use the optional integrations in a conda environment, install the extras you need alongside it, for example:

conda install -c conda-forge sklearn-genetic-opt seaborn mlflow

Requirements

Core requirements:

  • Python >= 3.12

  • scikit-learn >= 1.5.0

  • NumPy >= 1.26.0

  • DEAP >= 1.4.0

  • tqdm >= 4.60.0

Optional extras:

  • Seaborn >= 0.13.2 for plots

  • MLflow >= 3.14.0 for experiment logging

  • TensorFlow >= 2.21.0 and TensorBoard >= 2.20.0,<2.21.0 for TensorBoard logging on Python < 3.14

Quick Start: Feature Selection

import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from sklearn_genetic import (
    EvolutionConfig,
    GAFeatureSelectionCV,
    PopulationConfig,
    RuntimeConfig,
)
from sklearn_genetic.schedules import ExponentialAdapter

X, y = load_iris(return_X_y=True)
noise = np.random.default_rng(42).uniform(0, 1, size=(X.shape[0], 8))
X = np.hstack([X, noise])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=42,
)

selector = GAFeatureSelectionCV(
    estimator=SVC(gamma="auto"),
    cv=3,
    scoring="accuracy",
    evolution_config=EvolutionConfig(
        population_size=12,
        generations=8,
        mutation_probability=ExponentialAdapter(0.8, 0.2, 0.05),
        crossover_probability=ExponentialAdapter(0.2, 0.8, 0.05),
    ),
    population_config=PopulationConfig(initializer="smart"),
    runtime_config=RuntimeConfig(n_jobs=-1, verbose=True),
)

selector.fit(X_train, y_train)

print(selector.support_)
print(accuracy_score(y_test, selector.predict(X_test)))
print(selector.transform(X_test).shape)

Improving Search Quality

The default PopulationConfig(initializer="smart") is recommended for most runs. It improves early search coverage without reducing the number of generations or population size.

For harder spaces, combine it with optimizer controls:

from sklearn_genetic import EvolutionConfig, OptimizationConfig, PopulationConfig, RuntimeConfig
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter

search = GASearchCV(
    estimator=estimator,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=3,
    evolution_config=EvolutionConfig(
        population_size=16,
        generations=12,
        crossover_probability=ExponentialAdapter(0.85, 0.45, 0.08),
        mutation_probability=InverseAdapter(0.18, 0.55, 0.12),
    ),
    population_config=PopulationConfig(
        initializer="smart",
        warm_start_configs=[{"C": 1.0, "class_weight": None}],
    ),
    runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True),
    optimization_config=OptimizationConfig(
        local_search=True,
        local_search_top_k=2,
        local_search_steps=2,
        diversity_control=True,
        random_immigrants_fraction=0.15,
        fitness_sharing=True,
    ),
)

Use:

  • PopulationConfig.warm_start_configs when you already know useful candidate settings.

  • OptimizationConfig(local_search=True) to refine the best candidates near the end of the run.

  • OptimizationConfig(diversity_control=True) to react when the population collapses too early.

  • OptimizationConfig(fitness_sharing=True) to keep multiple promising regions alive.

  • fit_stats_ to understand evaluation cost, cache hits, and skipped work.

  • history to inspect fitness, diversity, stagnation, and optimizer telemetry by generation.

Parallelism

RuntimeConfig.n_jobs controls parallel execution:

  • RuntimeConfig(n_jobs=1) runs sequentially.

  • RuntimeConfig(n_jobs=-1) uses all available CPU cores.

  • RuntimeConfig(n_jobs=k) uses k workers.

With RuntimeConfig(parallel_backend="auto") or RuntimeConfig(parallel_backend="population"), unique candidates in the same generation are evaluated in parallel and each candidate runs cross-validation sequentially to avoid nested parallelism. Use RuntimeConfig(parallel_backend="cv") to evaluate candidates serially while passing n_jobs to each candidate’s cross-validation call.

Persistence and Checkpointing

Use ModelCheckpoint to write progress during long searches:

from sklearn_genetic.callbacks import ModelCheckpoint

search.fit(X_train, y_train, callbacks=[ModelCheckpoint("checkpoint.pkl")])

Use save and load when you want to persist a fitted search object:

search.save("ga_search.pkl")
restored = GASearchCV(estimator=estimator, param_grid=param_grid)
restored.load("ga_search.pkl")

Benchmarks

The repository includes benchmark scripts for optimizer mechanics, model metrics, and comparisons against scikit-learn search methods:

python benchmarks/benchmark_fit.py --quick
python benchmarks/benchmark_fit.py --parallel-backends auto cv --runs 3
python benchmarks/benchmark_search_methods.py --quick
python benchmarks/benchmark_search_methods.py --methods gasearch randomized grid --runs 3

The reports include runtime, evaluated candidates, cross-validation effort, cache/duplicate counts, optimizer telemetry, holdout metrics, and best parameters.

Documentation and Examples

Useful links:

The documentation includes tutorials and executed notebooks for:

  • comparing GASearchCV with sklearn search methods

  • pipeline tuning and prediction

  • feature selection

  • multi-metric optimization

  • MLflow 3 logging

  • checkpointing and persistence

  • advanced optimizer controls

Troubleshooting

TypeError: param_grid values must be instances of Integer, Continuous or Categorical

Use sklearn_genetic.space objects instead of plain lists.

from sklearn_genetic.space import Integer

param_grid = {"max_depth": Integer(2, 10)}
Missing optional dependencies

Install the optional extras:

pip install sklearn-genetic-opt[all]
TensorFlow, TensorBoard, or MLflow dependency conflicts

Install only the extras you need, or use a clean virtual environment. TensorFlow/TensorBoard support is only available on Python versions supported by those projects.

Contributing

Contributions are welcome. Please read the contribution guide, open issues for bugs or proposals, and include tests and documentation when changing behavior.

For local development:

git clone https://github.com/rodrigo-arenas/Sklearn-genetic-opt.git
cd Sklearn-genetic-opt
pip install -r dev-requirements.txt
pytest sklearn_genetic

Big thanks to everyone helping improve the project.

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sklearn_genetic_opt-0.13.0.tar.gz (60.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sklearn_genetic_opt-0.13.0-py3-none-any.whl (63.6 kB view details)

Uploaded Python 3

File details

Details for the file sklearn_genetic_opt-0.13.0.tar.gz.

File metadata

  • Download URL: sklearn_genetic_opt-0.13.0.tar.gz
  • Upload date:
  • Size: 60.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for sklearn_genetic_opt-0.13.0.tar.gz
Algorithm Hash digest
SHA256 2a6200c4f3e808cc6869bf5c4e6c1f6b69c6fc88340c359590e08fffad70eaee
MD5 413819f9d754fce695dd82eb803e19d6
BLAKE2b-256 b6037be4aaa641a7396d673bfdc42791f2c48273e129649df11bde265977ff72

See more details on using hashes here.

File details

Details for the file sklearn_genetic_opt-0.13.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sklearn_genetic_opt-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc4d1d39c1cea67d429a5fba778719dc9bc897b3243b7303a26a8888c6e125ea
MD5 2edec0c76f64887bc551dc6398e64706
BLAKE2b-256 3e12aed378f9a38ae8bb56df4b046871b9d06aa8d05981c749331b1227e66fd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page