Skip to main content

A Python package for generation and optimization of machine learning pipelines using an evolutionary algorithm

Project description

PipeGenie

PipeGenie

Table of Contents

Introduction

PipeGenie is a powerful Python tool for automating the generation and optimization of machine learning pipelines. By leveraging grammar-guided genetic programming (G3P), PipeGenie enables both novices and experts to efficiently create robust machine learning workflows. The intuitive design ensures compatibility with the widely-used scikit-learn library and its pipeline API, making it accessible for a broad audience.

Whether you're a data scientist looking to streamline your workflow or a developer exploring machine learning, PipeGenie simplifies the process of pipeline optimization, allowing you to focus on achieving the best results. With support for both classification and regression tasks, PipeGenie is a versatile tool that can help you tackle a wide range of machine learning challenges.

Features

  • Automated Pipeline Generation: Simplifies the creation of machine learning pipelines.
  • Grammar-Guided Genetic Programming: Uses a set of predefined rules to evolve pipelines efficiently.
  • Compatibility: Seamlessly integrates with scikit-learn.
  • User-Friendly: Minimal machine learning knowledge required to get started.
  • Support for Classification and Regression: Capable of handling both types of tasks.
  • Future Plans: Support for time series.

Installation

Requirements

  • Windows: Python ≥ 3.10, < 3.13
  • macOS/Linux: Python ≥ 3.10, < 3.13
  • pip and setuptools should be up to date:
python -m pip install --upgrade pip setuptools

Installing the Package

Before installing PipeGenie, it is recommended to create a virtual environment to avoid conflicts with other packages. To create a virtual environment, you can use the following command:

python -m venv pipegenie-env

After creating the virtual environment, you can activate it using the following command:

source pipegenie-env/bin/activate

You can install PipeGenie using pip:

pip install pipegenie

You can also install PipeGenie from source by cloning the repository, navigating to the root directory of the repository, and running:

pip install .

or installing it in editable mode:

pip install -e .

If you see a Python version error, ensure your system meets the minimum requirements.

Usage

NOTE: for Windows compatibility, you need to include multiprocessing.freeze_support() within a conditional block that checks if the script is being run as the main program, and specifically call it at the very beginning of that block. However, PipeGenie uses multiprocessing to speed up the evolutionary process and an evaluation timeout to avoid long-running pipelines. When the evaluation timeout is reached, the process is terminated, and the pipeline is considered invalid. While completing the task and generating all related files, Windows has some issues terminating this process, raising warnings and errors every time it tries to do so.

...
from sklearn.datasets import load_iris
import multiprocessing # Import the multiprocessing module

if __name__ == '__main__':
    # Call freeze_support() only if running on Windows
    multiprocessing.freeze_support()

    # Load the dataset
    X, y = load_iris(return_X_y=True)

    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    ...

Classification Example

You can try the following code to generate a model for a classification task using the PipegenieClassifier class:

from pipegenie.classification import PipegenieClassifier
from pipegenie.model_selection import train_test_split
from pipegenie.metrics import balanced_accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
X, y = load_iris(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Define the grammar file
grammar_file = "tutorials/sample-grammar-classification.xml"

# Create an instance of the PipegenieClassifier
model = PipegenieClassifier(
    grammar=grammar_file,
    generations=5,
    pop_size=50,
    elite_size=5,
    n_jobs=5,
    seed=42,
    outdir="sample-results",
)

# Fit the classifier
model.fit(X_train, y_train)

# Predict using the classifier
y_pred = model.predict(X_test)

# Evaluate the classifier using the default scoring method
print(f"Model score: {model.score(X_test, y_test)}")

# Evaluate the classifier using another metric
print(f"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}")

You can also check the following Jupyter notebook for a more detailed example: Classification Example.

Regression Example

You can try the following code to generate a model for a regression task using the PipegenieRegressor class:

from pipegenie.regression import PipegenieRegressor
from pipegenie.model_selection import train_test_split
from pipegenie.metrics import root_mean_squared_error
from sklearn.datasets import load_diabetes

# Load the dataset
X, y = load_diabetes(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=9)

# Define the grammar file
grammar_file = "tutorials/sample-grammar-regression.xml"

# Create an instance of the PipegenieRegressor
model = PipegenieRegressor(
    grammar=grammar_file,
    generations=5,
    pop_size=50,
    elite_size=5,
    maximization=False,
    n_jobs=5,
    seed=9,
    outdir="sample-results",
)

# Fit the regressor
model.fit(X_train, y_train)

# Predict using the regressor
y_pred = model.predict(X_test)

# Evaluate the regressor using the default scoring method
print(f"Model score: {model.score(X_test, y_test)}")

# Evaluate the regressor using another metric
print(f"Root mean squared error: {root_mean_squared_error(y_test, y_pred)}")

You can also check the following Jupyter notebook for a more detailed example: Regression Example.

Saving and Loading Models

Once you have trained a model, you may want to save it to disk for later use. To save a model, you can use the provided save_model method:

model.save_model("model.pkl")

This method will save the internal ensemble created after the evolutionary process.

In case you want to save the model manually, you can use the following code:

import pickle

with open("model.pkl", "wb") as file:
    pickle.dump(model.ensemble, file)

To load a model, you can use the pickle module:

import pickle

with open("model.pkl", "rb") as file:
    ensemble = pickle.load(file)

PipeGenie's Internals

Grammar Definition

To be able to use PipeGenie, first you need to define the grammar that will be used to generate the pipelines. The grammar is a XML file that contains the rules that will be used to generate the pipelines. The grammar file should be structured as follows:

  • <grammar> </grammar> : The root element of the file. All components of the grammar must be within these tags.

  • <root-symbol> </root-symbol> : The initial symbol of the grammar. Derivations start from this symbol to create an algorithm flow and its corresponding hyperparameters. Should be written within the <grammar> tags.

  • <non-terminals> </non-terminals> : This tag contains all the production rules tags that can be used by the evolutionary process to generate a pipeline. Should be written within the <grammar> tags.

  • <non-terminal> </non-terminal> : Specifies a context-free grammar non terminal symbol. Should be written within the <non-terminals> tags. Its attribute is:

    • name : Name of the antecedent of the production rule.
  • <production-rule> </production-rule> : Specifies the consequent of a production rule. Terminals and non-terminals can be used in the production rule and should be separated by a ;. The symbols of the consequent have to be written with the same name as the <terminal> and <non-terminal> tags. To specify that an algorithm has hyperparameters, either terminals (list of hyperparameter <terminal> tags separated by a ;) or a non-terminal (name of the non-terminal tag that contains the hyperparameters) should be used and written between parentheses (e.g., RandomForestClassifier(rf::n_estimators;rf::max_depth) or RandomForestClassifier(rf_hyperparameters)). Should be written within the <non-terminal> tags.

  • <terminals> </terminals> : This tag contains both algorithms and hyperparameters tags that can be used by the evolutionary process to generate a pipeline. Should be written within the <grammar> tags.

  • <terminal> </terminal> : Used to specify a terminal symbol of the grammar. It can be an algorithm or a hyperparameter. Should be written within the <terminals> tags. To define an algorithm, the attributes are:

    • name : Identifier name for the algorithm for string representation.
    • type : Category of the algorithm for grouping purposes. This string can contain any value, allowing grouping of terminals with the same string value. Optional.
    • code : Package and name of the algorithm to be imported to use in the evolutionary process (e.g., sklearn.ensemble.RandomForestClassifier). It can be any algorithm compatible with scikit-learn API. Otherwise, a custom wrapper should be created to adapt the API.

    To define a hyperparameter, the attributes are:

    • name : Identifier name for the hyperparameter. Should be written as algorithm_id::hyperparameter_name (e.g., rf::n_estimators) and should be unique for each hyperparameter. algorithm_id can be any string that identifies the algorithm to which the hyperparameter belongs. It is used in order to avoid conflicts between hyperparameters with the same name in different algorithms. hyperparameter_name should be the same as the hyperparameter name in the algorithm's constructor.
    • type : Data type of the hyperparameter (int, float, categorical, bool, estimator, tuple or union). For a fix value, use the prefix fix_.
    • lower : Lower limit for numerical hyperparameters.
    • upper : Upper limit for numerical hyperparameters.
    • log : Specifies whether the hyperparameter should be sampled in log scale. Only used for numerical hyperparameters.
    • step : Specifies the step when sampling the hyperparameter. Only used for numerical hyperparameters.
    • values : Specifies the possible values for categorical hyperparameters. Only used for categorical hyperparameters.
    • estimators : Points to a <non-terminal> tag that contains the possible algorithms that can be used as hyperparameters. Only used for estimator hyperparameters.
    • value : Specifies the value for fixed hyperparameters. Only used for fixed hyperparameters.
    • default : Default value for the hyperparameter.
  • <tuple> </tuple> : Specifies a tuple of values for a hyperparameter (e.g., MLPClassifier::hidden_layer_sizes). Should be written within a <terminal> tag whose type is tuple. Its attributes are the same as the <terminal> tag to define hyperparameters, except for the name attribute.

  • <union> </union> : Specifies a union of value types that can be used for a hyperparameter (e.g. RandomForestClassifier::max_features can be either a categorical, integer or float hyperparameter). Should be written within a <terminal> tag whose type is union. Its attributes are the same as the <terminal> tag to define hyperparameters, except for the name attribute.

A full example of grammar files can be found in the tutorials folder.

To modify the grammar, you can create your own grammar file or modify the existing ones. An explanation of how to change the grammar file can be found here.

Outputs

After running the code, you will find a folder named sample-results (or the name you specified in the outdir parameter) containing the following files:

  • config.txt: Contains the configuration (parameters passed to the constructor) used for the evolutionary process. An example of this file can be found here.
  • best_pipeline.txt: Contains the best pipeline found by the evolutionary process, its fitness value and its predictions. An example of this file can be found here.
  • ensemble.txt: Contains the ensemble of pipelines found by the evolutionary process, their individual fitness values, the global fitness value, the weights of the pipelines in the ensemble and the predictions of the ensemble. An example of this file can be found here.
  • evolution.txt: Contains the evolution of the population during the evolutionary process. For each generation, it shows the number of evaluated individuals, the minimum, maximum, average and standard deviation of the fitness values of the population, fitness values of the elite, pipeline sizes of the population and pipeline sizes of the elite. At the end of the process, it will show the time in seconds that the process took. An example of this file can be found here.
  • individuals.tsv: Contains all the individuals evaluated during the evolutionary process. For each individual, it shows the pipeline, its fitness value and the time it took to evaluate it. If the individual was invalid, it will show the reason. An example of this file can be found here.

For more information about the outputs, you can check the Understanding PipeGenie Outputs tutorial.

FAQ

Q: What is grammar-guided genetic programming (G3P)?

A: G3P is a form of genetic programming that uses a grammar to guide the evolution of programs, ensuring they adhere to predefined structures and rules.

Q: Can I use PipeGenie with other machine learning libraries?

A: PipeGenie is designed to work with scikit-learn, so any library that is API-compatible with scikit-learn should work. If you want to use PipeGenie with a different library, you may need to make wrapper classes to adapt the API.

Q: Why the save_model method only saves the ensemble and not the Pipegenie object?

A: The ensemble is the most important part of the Pipegenie object, as it is responsible for making predictions. The Pipegenie object is only used to create and train the ensemble, so it is not necessary to save it. Also, due to limitations in the pickle module, it is not possible to save the Pipegenie object. In case you want to save the Pipegenie object (save the ensemble as well as the loaded grammar and configuration), you should use other serialization libraries like dill or cloudpickle.

Contributors

  • Rafael Barbudo Lunar

  • Mario Berrios Carmona

  • Carlos García Martínez

  • Vincent Gardies

  • Ángel Fuentes Almoguera

  • Aurora Ramírez Quesada

  • José Raúl Romero Salguero

Citing PipeGenie

If you are using PipeGenie in your research or project, please cite using the following BibTeX entry:

@article{BARBUDO2024111292,
    title   = {Grammar-based evolutionary approach for automated workflow composition with domain-specific operators and ensemble diversity},
    journal = {Applied Soft Computing},
    volume  = {153},
    pages   = {111292},
    year    = {2024},
    issn    = {1568-4946},
    doi     = {https://doi.org/10.1016/j.asoc.2024.111292},
    url     = {https://www.sciencedirect.com/science/article/pii/S1568494624000668},
    author  = {Rafael Barbudo and Aurora Ramírez and José Raúl Romero},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipegenie-0.7.0.tar.gz (70.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipegenie-0.7.0-py3-none-any.whl (75.6 kB view details)

Uploaded Python 3

File details

Details for the file pipegenie-0.7.0.tar.gz.

File metadata

  • Download URL: pipegenie-0.7.0.tar.gz
  • Upload date:
  • Size: 70.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pipegenie-0.7.0.tar.gz
Algorithm Hash digest
SHA256 4a4163350bd27c58c4aa66d2ddf8b0954092cf9bfed8573e55404c16a3406179
MD5 361b4b68b8bbabcf32dc198b98b2c916
BLAKE2b-256 96e0c0fa2a4881d07ccfb8695646179b16cf4d9032bbefb1bc8ceb60432cda5a

See more details on using hashes here.

File details

Details for the file pipegenie-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: pipegenie-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 75.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pipegenie-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbe3e0663de89326541cd88b5ee89d260ea0a5e04747e84a90c3e40699012330
MD5 3b81d12c7161a77b90fe808c741c5023
BLAKE2b-256 d9128ecf2c1a3773ee52db63f941611997036b9eebc72b8255f801c68e8fb4fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page